ENH: improve performance of df.to_csv GH3054 #3059

Merged
merged 27 commits into from Mar 19, 2013

Conversation

Projects
None yet
4 participants
Contributor

y-p commented Mar 15, 2013

Needs more testing before merging.

Following SO question mentioned in #3054:

In [7]: def df_to_csv(df,fname):
   ...:     fh=open(fname,'w')
   ...:     fh.write(','.join(df.columns) + '\n')
   ...:     for row in df.itertuples(index=False):
   ...:         slist = [str(x) for x in row]
   ...:         ss = ','.join(slist) + "\n"
   ...:         fh.write(ss)
   ...:     fh.close()
   ...: 
   ...: aa=pd.DataFrame({'A':range(100000)})
   ...: aa['B'] = aa.A + 1.0
   ...: aa['C'] = aa.A + 2.0
   ...: aa['D'] = aa.A + 3.0
   ...: 
   ...: %timeit -r10 aa.to_csv('/tmp/junk1',index=False)   
   ...: %timeit -r10 df_to_csv(aa,'/tmp/junk2') 
   ...: from hashlib import sha1
   ...: print sha1(open("/tmp/junk1").read()).hexdigest()
   ...: print sha1(open("/tmp/junk2").read()).hexdigest()
current pandaswith PRexample code
2.3 s1.29 s1.28 s

wins:

  • convert numpy numerics to native types to eliminate expensive numpy specific
    stringify calls.
  • if number of columns is < 10000, precompute the cols loop range rather
    then creating and walking a generator at each iteration of the inner loop.
  • some cargo cult stuff that's probably in the noise.
Contributor

jreback commented Mar 15, 2013

Case 2

aa=pd.DataFrame({'A':range(100000),'B' : pd.Timestamp('20010101')})
aa.ix[100:2000,'A'] = np.nan

Current pandas

In [4]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 1.57 s per loop

case 2

In [7]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 2.66 s per loop

With change (starting with yours + my patch)

In [3]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 825 ms per loop

case 2

In [5]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 1.96 s per loop
Contributor

y-p commented Mar 15, 2013

Thanks jeff. the patch was garbled by GH. Can you pull this branch, add your patch
as a new commit and push to your fork? I'll pick up the commit and add it to the PR.

Contributor

y-p commented Mar 15, 2013

After jeff's commit:

current pandaswith PRexample code
2.3 s0.58s1.28 s
Contributor

y-p commented Mar 15, 2013

L1389 sure looks like one loop too many.

Contributor

jreback commented Mar 15, 2013

I played around more with this, but all of the pre-allocation and array indexing seems to be working, (even the map around asscalar makes a big diff), so unless we write this loop in cython, not sure how much better we can do

Contributor

y-p commented Mar 16, 2013

Another 25%:

current pandaswith PRexample code
2.3 s0.44s1.28 s
Contributor

jreback commented Mar 16, 2013

I get about 20% more with cython.....(on my test case that has lots of dates).....
jreback/pandas@5335a80
jreback/pandas@fa12e63

Contributor

jreback commented Mar 16, 2013

I think you win! (I get 0.47s) with the same test (an 2.3s with current pandas) :(
hmm....though If i were to pre-allocate the rows...., I am using a single row and overwriting (and copy to the writer)

Contributor

y-p commented Mar 16, 2013

wait, I'm following your lead and cythonizing the latest python version. let's see...

Contributor

y-p commented Mar 16, 2013

Remember we're benchmarking on different machines, timings differ. I get 0.407 with your cython branch.

Contributor

y-p commented Mar 16, 2013

310ms

Contributor

jreback commented Mar 16, 2013

whoosh
it's the preallocation of rows
pretty good 8x speedup

Contributor

y-p commented Mar 16, 2013

hang on, I'm working on a linux kernel patch...

Contributor

y-p commented Mar 16, 2013

the list() cast in the call into cython eliminates some test failures that
occur when passing in the index ndarray, have no idea why.

another 20% (relative):

current pandaswith PRexample code
2.3 s0.35s1.28 s

any more tricks? probably heavily IO bound at this point.
edit: not really.

Contributor

jreback commented Mar 16, 2013

could try doing a tolist() on the numeric arrays (in helper, when creating the series)
so that can eliminate the np.asscalar test and conversions

Contributor

y-p commented Mar 16, 2013

If it works, post the commit hash and i'll update.

Contributor

jreback commented Mar 16, 2013

try this out

jreback/pandas@d78f4f6

Contributor

y-p commented Mar 16, 2013

240ms. very nice.
But this doubles the memory footprint doesn't it? definitely add as option.

Contributor

jreback commented Mar 16, 2013

your wish is my command :) had the same issue in HDFStore, so chunked it

jreback/pandas@55adfb7

Contributor

y-p commented Mar 16, 2013

very good, and that also bounds the memory used by the list() copy in the call to cython.

I hereby declare these games over. 10x will have to do...

👍

Contributor

jreback commented Mar 16, 2013

yep

go team!

Contributor

y-p commented Mar 16, 2013

refactored things a bit. @jreback the last iteration broke a vbench, take a look?

I'll be adding some tests, and if all goes well I'm +1 for this going into 0.11.

Contributor

jreback commented Mar 16, 2013

I will take a look

Contributor

jreback commented Mar 16, 2013

what vbench fails for you, all of io_bench.py seems to run for me?

Contributor

jreback commented Mar 16, 2013

added this commit, just revised the vbenchs a bit (added one that is identical to the op),
as well as the mixed test case

51793db

You must have done something, speed is even better than last night!

This is amazing (not a wide frame but 25M+ rows p/m)

In [20]: aa=pd.DataFrame({'A':range(10000000)})

In [21]: aa['B'] = aa.A + 1.0

In [22]: aa['C'] = aa.A + 2.0

In [23]: aa['D'] = aa.A + 3.0

In [24]: %timeit aa.to_csv('test.csv')
1 loops, best of 3: 22 s per loop

Against: 3e55bd7

Results:
                                                     t_head t_baseline      ratio
name                                                                    
frame_to_csv2  (100,000)                           182.7691  2237.0021     0.0817
write_csv_standard  (10,000)                        35.4316   238.0822     0.1488
frame_to_csv_mixed (10,000 + wide)                 379.7200  1126.1289     0.3372
frame_to_csv  (3,000)                              150.2640   220.4599     0.6816

Contributor

y-p commented Mar 16, 2013

I did the native type treatment for the index and colums as well, and
yanked a repeated cast out of the chunking loop. But mostly for "correctness",
I couldn't measure any effect on the base case, maybe with more
complex indexes it does more.

Contributor

y-p commented Mar 16, 2013

the vbench failue was from travis: https://travis-ci.org/y-p/pandas/jobs/5549764.

I fixed another vb problem in c682aa6. random data generated in the setup
sometimes hits corner cases where the operation is unsupppored, e.g.
non-unique index.

Maybe that's the reason for the failue, but it would be a strange coincidence
if the "csv_mixed" vbench happend to fault by chance, unrelated to our new
commits.

Contributor

jreback commented Mar 16, 2013

I replaced that vbench with unique names in the columns

new one should be ok

that vbench had duplicate column names across dtypes

I'll have to look at a test for that

Contributor

y-p commented Mar 16, 2013

Do you want me to add the commits to this PR? better just push to master, I think.

Contributor

jreback commented Mar 16, 2013

I think u can push everything u have
if u can add the vbench commit great (if not I can push later)

I will add a test later

Contributor

y-p commented Mar 16, 2013

I'd like to add some tests specific to the new code, before merging. The new code has more
corner cases then the old.

wes push 0.11 back, we've got time to get this in. But the vb patches should be
pushed regardless.

Contributor

jreback commented Mar 16, 2013

the commit I indicated is pushed to my local
but I think u have to add your local PR?
as I can't push to your local, right?

Contributor

y-p commented Mar 16, 2013

I could add you as a collab to my repo but I think syncing before merges
was working well for us.

the VB fix, in 51793db, should be pushed to pydata/pandas master.
is there another commit you'd like me to add to this PR branch?

Contributor

jreback commented Mar 16, 2013

go ahead and push to master
I'll push the vbench ( and a test for the error)

Contributor

jreback commented Mar 17, 2013

this commit: f5fe433581d0a68dbc53f2c7b5c87c7370afcbf3

adds a test for the failing issue. unfortunately not much can do about it. I had constructed duplicate column names across dtypes, which raises an exception, but a very tiny edge case.

changed the dict of series to a list (and changed cythong) too

20% more improvement (203ms)...as we don't have the double lookups and the column numbers are straight list indexing.....

if you can incorporate the my last 2 commits and push would be great

thxs

Contributor

y-p commented Mar 17, 2013

done.

Contributor

y-p commented Mar 17, 2013

223ms

@jreback @y-p jreback TST: test for to_csv on failing vbench
duplicate column names across dtypes is a problem, and not-easy to fix,
so letting test fail .
bb7d1da
Contributor

jreback commented Mar 17, 2013

Revised Perf against: 3e55bd7

Results:
                                            t_head t_baseline      ratio
name                                                                    
frame_to_csv2                             173.7061  2267.7870     0.0766
write_csv_standard                         34.7550   240.4461     0.1445
frame_to_csv_mixed                        360.0419  1123.1232     0.3206
frame_to_csv                              108.3450   229.2049     0.4727
Contributor

y-p commented Mar 17, 2013

Hmmm. why is frame_to_csv seeing only 2x?

frame_to_csv has 10% less data (90k vs 100k) then frame_to_csv2, and is seeing
6 times less improvement..

maybe some perf boost left on the ground for wide/short frames?

correction: that's (3000,30) vs (100k,4), so about 75% less data.

Contributor

jreback commented Mar 17, 2013

there is prob a bit of fixed overhead
so < 10000 maybe hitting that

Contributor

y-p commented Mar 17, 2013

still, it's enough data to make the overhead negligible I would expect.

There's something else going on here, bad scaling behavior in frame width:

In [20]: N = 100000
    ...: for w in [1,10,100,500,1000,2500,5000,10000,20000,100000]:
    ...:     df= pd.DataFrame(np.random.randn(N/w, w))
    ...:     print df.shape
    ...:     %timeit -n2 -r5 df.to_csv('/tmp/junk1',index=False)
(100000, 1)
2 loops, best of 5: 185 ms per loop
(10000, 10)
2 loops, best of 5: 150 ms per loop
(1000, 100)
2 loops, best of 5: 154 ms per loop
(200, 500)
2 loops, best of 5: 197 ms per loop
(100, 1000)
2 loops, best of 5: 320 ms per loop
(40, 2500)
2 loops, best of 5: 412 ms per loop
(20, 5000)
2 loops, best of 5: 591 ms per loop
(10, 10000)
2 loops, best of 5: 1.02 s per loop

constant prod(shape) should take constant runtime (roughly), if we're IO bound (we should be).
Edit: still not really.

Contributor

jreback commented Mar 17, 2013

maybe set the chunk size as a function of the cols
to get a constant amount of data written
say 100k rows for 10 cols

so at 10000 cols only write 10000 rows at a time

too much and somewhere doing lots of mem allocation

Contributor

y-p commented Mar 17, 2013

Changing the row count constant 10x to 1000 produced no change in the curve,
so maybe something else.

Maybe the preallocated rows list outgrows cpu cache?
hard to believe. Will look more closely some other time.

Contributor

y-p commented Mar 17, 2013

still doesn't scale properly. but good idea anyway

Contributor

y-p commented Mar 17, 2013

In [2]: N = 100000
   ...: for w in [1,10,100,500,1000,2500,5000,10000,20000,100000]:
   ...:     df= pd.DataFrame(np.random.randn(N/w, w))
   ...:     print df.shape
   ...:     %timeit -n2 -r5 df.to_csv('/tmp/junk1',index=False)
(100000, 1)
2 loops, best of 5: 198 ms per loop
(10000, 10)
2 loops, best of 5: 196 ms per loop
(1000, 100)
2 loops, best of 5: 188 ms per loop
(200, 500)
2 loops, best of 5: 159 ms per loop
(100, 1000)
2 loops, best of 5: 156 ms per loop
(40, 2500)
2 loops, best of 5: 163 ms per loop
(20, 5000)
2 loops, best of 5: 175 ms per loop
(10, 10000)
2 loops, best of 5: 215 ms per loop
(5, 20000)
2 loops, best of 5: 194 ms per loop
(1, 100000)
2 loops, best of 5: 350 ms per loop

y-p and others added some commits Mar 17, 2013

@y-p y-p PERF: avoid iteritems->iloc panelty for data conversion, use blocks 20d3247
@jreback jreback TST: test for to_csv on failing vbench
     duplicate column names across dtypes is a problem, and
     not-easy to fix, so letting test fail
67ca8ae
Contributor

jreback commented Mar 17, 2013

refactor to move to_native_types into internals and index
csvformatting to format.py
(I started with your last commit)
perf is the same

2a3211498c89930df3ebb99de145506e54e2a60d

Contributor

y-p commented Mar 17, 2013

done. looks good.

Contributor

jreback commented Mar 17, 2013

in theory should add a test for this:

format a multi index that has a datetime embedded
and one with NaT

as I sort of cheated on the mi and just called tolist
it prob works, just didn't test it

Contributor

y-p commented Mar 17, 2013

it prob works, just didn't test it.

Famous last words if ever I heard them :).
I'll be beefing up the tests anyway, I'll do it.

Contributor

jreback commented Mar 17, 2013

true but wasn't tested before!

Contributor

y-p commented Mar 17, 2013

Travis is completely red on the new commit.

Contributor

jreback commented Mar 17, 2013

jreback/pandas@e699fb6
updated, that's what I get for rushing (didn't even run all tests)

Contributor

jreback commented Mar 17, 2013

your might want to add a check and fail early if duplicate columns and not a mi
right now will fail with an odd exception None has no getitem

Contributor

y-p commented Mar 17, 2013

almost iambic pentameter, but not quite.

sorry, I lost you. what fails if there are dupe columns?

Contributor

y-p commented Mar 17, 2013

oh, I did use a dict for the colname -> index mapping. that'll cause
some trouble. thanks, will fix.

Contributor

jreback commented Mar 17, 2013

I put a test for this (which does raise)

something like mixed duplicates

it fails correctly, but the I think should have it fail much earlier with a better exception
(the exception happens when u can't index into d in save_chunk

Contributor

jreback commented Mar 17, 2013

http://stackoverflow.com/questions/15462344/writing-the-result-of-multiple-calls-on-a-manipulated-dataframe-to-one-csv-fil

with adding an option for this? maybe append? (default to False)
can do easily with mode='a', header =False
on subsequent calls

Contributor

y-p commented Mar 17, 2013

In a seperate PR. We have Excelwriter for that, makes sense for CSVWriter to allow this as well.

Contributor

jreback commented Mar 17, 2013

Contributor

jreback commented Mar 18, 2013

any issues remaining in this?

Contributor

jreback commented Mar 18, 2013

fyi....just to compare vs writing to HDFStore as a table, same frame as testing here, 100k rows, takes about the same now, 200ms

Contributor

y-p commented Mar 18, 2013

That's encouraging, means we're probably done.

I want to add tests for dataframes dimensioned in and around "magic numbers"
in the code, N=1000 etc', and maybe add a legacy keyword or similar to to_csv(),
so if data corruption crept in and this is in 0.11, we can give people a way to
revert back to the old code, without working off master.

I'll definitely get this into 0.11, but I'm putting it off for a few days,
while working on something else.

Contributor

jreback commented Mar 18, 2013

np...take time

I was trying to see if writing to a table was a problem, but the writing is already block separated and in cython (slow part actually is a non-trivial issue because it needs a rec-array under the hood, and even though i am using cython to do this, i think the actual usage is quite slow)....but pretty good for now

Contributor

y-p commented Mar 18, 2013

and let's get a nod from @wesm before merging.

Contributor

jreback commented Mar 18, 2013

@wesm ok to merge for 0.11...pretty nice imporovement, no api change (except addition of chunksize parms)

Contributor

y-p commented Mar 18, 2013

test_to_csv_mixed_dups_cols fails for the new code but doesn't for the old code,
I didn't have time to look into it earlier, what up with that?

Contributor

jreback commented Mar 18, 2013

in 0.10.1
you can create this, but you can't do anything with it

df = pd.concat([ pd.DataFrame(np.random.randn(1000, 30),dtype='float64'),
                       pd.DataFrame(np.random.randn(1000, 30),dtype='int64')],axis=1)

e.g. just try printing it

df.to_csv(filename)

happens to work, but this I am not even sure if its right (I think there is some overwriting, but very hard to tell)

Contributor

y-p commented Mar 18, 2013

was this created by the new dtypes work you did for 0.11, or was it always there?

Contributor

y-p commented Mar 18, 2013

sorry, was present in 0.10.1 you said.

Contributor

jreback commented Mar 18, 2013

actually concat allows the creation of frame with the same column names across dtypes, but 0.10.1 doesn't, so the to_csv is sort of a red-herring.....

I guess I should fix concat so that this is not legal

Contributor

jreback commented Mar 18, 2013

ok...finally tracked this down.....to_csv 'works' with duplicated columns (e.g. a non-unique index), but IS wrong in 0.10.1, with multiple dtypes it 'works' as well, but again IS wrong

it is overwriting the data in the series[k](where k is column name), so , going to put in an explicit exception for this

Contributor

jreback commented Mar 18, 2013

2400877d9764fab7323ef5b783c2d858521f8f64

Contributor

y-p commented Mar 18, 2013

You have some conflict crud in there.

Contributor

jreback commented Mar 18, 2013

i know...i forgot to remove some stuff when i rebased, I don't know how to get rid of it (as its deletions.....)?

Contributor

y-p commented Mar 18, 2013

that's ok, i'll deal with it.

Owner

wesm commented Mar 19, 2013

Merge away. Nice teamwork guys

@y-p y-p merged commit 4d9a3d3 into pandas-dev:master Mar 19, 2013

Contributor

y-p commented Mar 19, 2013

One more small perf tweak. Heavy testing found only one bug, fixed.
Travis all green, and there's a hidden fallback to the old csv_helper code.

Ship it.

Contributor

jreback commented Mar 19, 2013

AWESOME!

y-p deleted the unknown repository branch Mar 19, 2013

Contributor

jreback commented Mar 19, 2013

typos in RELEASE.rst the 3059 link is pointing to 3039, and df.to_csv

•Improved performance of dv.to_csv() by up to 10x in some cases. (GH3059)
Contributor

y-p commented Mar 19, 2013

fixed.

Contributor

jreback commented Mar 19, 2013

fyi vs 0.10.1

Results:
                                            t_head  t_baseline      ratio
name                                                                     
frame_get_dtype_counts                      0.0988    217.2718     0.0005
frame_wide_repr                             0.5526    216.5370     0.0026
groupby_first_float32                       3.0029    341.3520     0.0088
groupby_last_float32                        3.1525    339.0419     0.0093
frame_to_csv2                             190.5260   2244.4260     0.0849
indexing_dataframe_boolean                 13.6755    126.9212     0.1077
write_csv_standard                         38.1940    234.2570     0.1630
frame_reindex_axis0                         0.3215      1.1042     0.2911
frame_to_csv_mixed                        369.0670   1123.0412     0.3286
frame_to_csv                              112.2720    226.7549     0.4951
frame_mult                                 22.6785     42.7152     0.5309
frame_add                                  24.3593     41.8012     0.5827
frame_reindex_upcast                       11.8235     17.0124     0.6950
frame_fancy_lookup_all                     15.0496     19.4497     0.7738
Contributor

y-p commented Mar 19, 2013

We should jigger the code so the first column of the test names spells "AwesomeRelease".

If I had to pick one thing, I'd say iloc/loc is the best addition in 0.11, though.
ix and I have been playing "20 questions" for far too long.

Contributor

jreback commented Mar 19, 2013

hahah....fyi I put up a commit on the centered mov stuff when you have a chance....all fixed except for rolling_count, which I am not quite sure what the results should be...

Contributor

y-p commented Mar 19, 2013

btw, the final perf tweak was all about avoiding iteritems and it's reliance
on icol. there might be a fast path hiding there which would get you another
0.0 entry on test_perf.sh.

Contributor

jreback commented Mar 19, 2013

I saw your data pre-allocation, did you change iteritems somewhere?

fyi....as a side issue, that's the problem in the duplicate col names, essentially the col label maps to the column in the block, but there isn't a simple map, in theory should be a positional map, but very tricky

Contributor

y-p commented Mar 19, 2013

@jreback , can you follow up on the SO question, and make a pandas user happy?

Contributor

y-p commented Mar 19, 2013

Your refactor calculated series using iteritems, and that was the cause of the O(ncols)
behaviour I noted. Eliminating that in favor of yanking the data out directly from the blocks
turned that into ~ O(1), but breaks encpsulation. would be nice to use a proper interface
rather then rifle the bowels of the underlying block manager.

Come to think of it, iteritems is col oriented, I wonder if all that could have
been avoided... with iterrows. oh boy.

Contributor

jreback commented Mar 19, 2013

iterrows does a lot of stuff....and the block based approach is better anyhow...I will answer the so question

Contributor

y-p commented Mar 19, 2013

The tests I added for duplicate columns was buggy, and didn't catch the fact
that dupe columns are disabled for to_csv.

v10.1 to_csv() mangles dupe names into "dupe.1,dupe.2," etc. That's an ok workaround,
but what's the fundamental reason we can't just do it straight? is there one?

correction: it's read_csv that does the mangling.

Contributor

jreback commented Mar 19, 2013

the issue is that the blocks have a 2-d of the items in a particular block, the mapping between where it is in the block and the frame is depedent on a unique name of the columns (in the case of a mi this is fine of course).

There isn't a positional map (from columns in the frame to in the block). Keeping one is pretty tricky.

Wes has a failry complicated routing to find the correct column even with the duplicates, and it succeeds unless they are across blocks (which is the case that I am testing here).

I suppose you could temporarily do a rename, on the frame, (no idea if that works), then iterate on the blocks. which will solve the problem. As I said, 0.10.1 actually prints the data out twice. I think raising is ok here, very unlikely to happen, and if it does you can just put a mi in the first place.

Contributor

y-p commented Mar 19, 2013

I don't follow. If you can display a repr for a frame with dupe columns, you can write it out to csv.

Contributor

jreback commented Mar 19, 2013

can't display a repr either......I will create an issue for this..

Contributor

jreback commented Mar 19, 2013

Contributor

y-p commented Mar 19, 2013

We're probably thinking of different things. what do you mean you can't repr a frame
with dupe columns?

In [19]: pd.DataFrame([[1,1]],columns=['a','a'])
Out[19]: 
   a  a
0  1  1

on the other hand

In [4]: df.to_csv("/tmp/a")
    807         if len(set(self.cols)) != len(self.cols):
--> 808             raise Exception("duplicate columns are not permitted in to_csv")
    809 
    810         self.colname_map = dict((k,i) for i,k in  enumerate(obj.columns))

Exception: duplicate columns are not permitted in to_csv

That's completely messed up.

Contributor

y-p commented Mar 19, 2013

I think I can rejig CSVWriter to do this properly, or at least discover what it is that I'm
failing to understand.

Contributor

jreback commented Mar 19, 2013

a bit contrived, but this is the case that's the issue (yours we should allow)
the problem is that detecting this case is hard

In [32]: df1 = pd.DataFrame([[1]],columns=['a'],dtype='float64')

In [33]: df2 = pd.DataFrame([[1]],columns=['a'],dtype='int64')

In [34]: df3 = pd.concat([df1,df2],axis=1)

In [35]: df3.columns
Out[35]: Index([a, a], dtype=object)

In [36]: df3.index
Out[36]: Int64Index([0], dtype=int64)

In [37]: df3
Out[37]: ----------------------------------------------
Exception: ('Cannot have duplicate column names split across dtypes', u'occurred at index a')
Contributor

y-p commented Mar 19, 2013

So:

  • the test that raises an exception should be finer grained, and fail only if dupes
    are across blocks.
  • I should fix up the way the data object is constructed to handle dupe columns

Then we can think about the general case.

y-p restored the unknown repository branch Mar 19, 2013

y-p deleted the unknown repository branch Mar 19, 2013

Contributor

y-p commented Mar 19, 2013

continued in #3095.

Thank you #3059, it's been fun.

Contributor

y-p commented Mar 19, 2013

dupe columns common case fixed in master via 1f138a4.

Contributor

y-p commented Mar 26, 2013

Changed 'legacy' keyword to engine=='python', to be consistent with c_parser.
in case it sticks around.

Guys this is truly amazing work. With the performance of reading/writing CSVs Pandas has become truly has enterprise leading I/O performance. Recently I was reading some zipped CSV files to DF at ~40MB/sec thinking to myself how much faster this is than many distributed 'Hadoop' solutions I've seen... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment