Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ENH: improve performance of df.to_csv GH3054 #3059
Conversation
|
Case 2
Current pandas
case 2
With change (starting with yours + my patch)
case 2
|
|
Thanks jeff. the patch was garbled by GH. Can you pull this branch, add your patch |
|
After jeff's commit:
|
|
L1389 sure looks like one loop too many. |
|
I played around more with this, but all of the pre-allocation and array indexing seems to be working, (even the map around asscalar makes a big diff), so unless we write this loop in cython, not sure how much better we can do |
|
Another 25%:
|
|
I get about 20% more with cython.....(on my test case that has lots of dates)..... |
|
I think you win! (I get 0.47s) with the same test (an 2.3s with current pandas) :( |
|
wait, I'm following your lead and cythonizing the latest python version. let's see... |
|
Remember we're benchmarking on different machines, timings differ. I get 0.407 with your cython branch. |
|
310ms |
|
whoosh |
|
hang on, I'm working on a linux kernel patch... |
|
the another 20% (relative):
any more tricks? probably heavily IO bound at this point. |
|
could try doing a tolist() on the numeric arrays (in helper, when creating the series) |
|
If it works, post the commit hash and i'll update. |
|
try this out |
|
240ms. very nice. |
|
your wish is my command :) had the same issue in HDFStore, so chunked it |
|
very good, and that also bounds the memory used by the list() copy in the call to cython. I hereby declare these games over. 10x will have to do... |
|
yep go team! |
|
refactored things a bit. @jreback the last iteration broke a vbench, take a look? I'll be adding some tests, and if all goes well I'm +1 for this going into 0.11. |
|
I will take a look |
|
what vbench fails for you, all of io_bench.py seems to run for me? |
|
added this commit, just revised the vbenchs a bit (added one that is identical to the op), You must have done something, speed is even better than last night! This is amazing (not a wide frame but 25M+ rows p/m)
Against: 3e55bd7
|
|
I did the native type treatment for the index and colums as well, and |
|
the vbench failue was from travis: https://travis-ci.org/y-p/pandas/jobs/5549764. I fixed another vb problem in c682aa6. random data generated in the setup Maybe that's the reason for the failue, but it would be a strange coincidence |
|
I replaced that vbench with unique names in the columns new one should be ok that vbench had duplicate column names across dtypes I'll have to look at a test for that |
|
Do you want me to add the commits to this PR? better just push to master, I think. |
|
I think u can push everything u have I will add a test later |
|
I'd like to add some tests specific to the new code, before merging. The new code has more wes push 0.11 back, we've got time to get this in. But the vb patches should be |
|
the commit I indicated is pushed to my local |
|
I could add you as a collab to my repo but I think syncing before merges the VB fix, in 51793db, should be pushed to pydata/pandas master. |
|
go ahead and push to master |
|
this commit: f5fe433581d0a68dbc53f2c7b5c87c7370afcbf3 adds a test for the failing issue. unfortunately not much can do about it. I had constructed duplicate column names across dtypes, which raises an exception, but a very tiny edge case. changed the dict of series to a list (and changed cythong) too 20% more improvement (203ms)...as we don't have the double lookups and the column numbers are straight list indexing..... if you can incorporate the my last 2 commits and push would be great thxs |
|
done. |
|
223ms |
|
Revised Perf against: 3e55bd7
|
|
Hmmm. why is frame_to_csv seeing only 2x? frame_to_csv has 10% less data (90k vs 100k) then frame_to_csv2, and is seeing maybe some perf boost left on the ground for wide/short frames? correction: that's (3000,30) vs (100k,4), so about 75% less data. |
|
there is prob a bit of fixed overhead |
|
still, it's enough data to make the overhead negligible I would expect. There's something else going on here, bad scaling behavior in frame width: In [20]: N = 100000
...: for w in [1,10,100,500,1000,2500,5000,10000,20000,100000]:
...: df= pd.DataFrame(np.random.randn(N/w, w))
...: print df.shape
...: %timeit -n2 -r5 df.to_csv('/tmp/junk1',index=False)
(100000, 1)
2 loops, best of 5: 185 ms per loop
(10000, 10)
2 loops, best of 5: 150 ms per loop
(1000, 100)
2 loops, best of 5: 154 ms per loop
(200, 500)
2 loops, best of 5: 197 ms per loop
(100, 1000)
2 loops, best of 5: 320 ms per loop
(40, 2500)
2 loops, best of 5: 412 ms per loop
(20, 5000)
2 loops, best of 5: 591 ms per loop
(10, 10000)
2 loops, best of 5: 1.02 s per loopconstant |
|
maybe set the chunk size as a function of the cols so at 10000 cols only write 10000 rows at a time too much and somewhere doing lots of mem allocation |
|
Changing the row count constant 10x to 1000 produced no change in the curve, Maybe the preallocated rows list outgrows cpu cache? |
|
still doesn't scale properly. but good idea anyway |
In [2]: N = 100000
...: for w in [1,10,100,500,1000,2500,5000,10000,20000,100000]:
...: df= pd.DataFrame(np.random.randn(N/w, w))
...: print df.shape
...: %timeit -n2 -r5 df.to_csv('/tmp/junk1',index=False)
(100000, 1)
2 loops, best of 5: 198 ms per loop
(10000, 10)
2 loops, best of 5: 196 ms per loop
(1000, 100)
2 loops, best of 5: 188 ms per loop
(200, 500)
2 loops, best of 5: 159 ms per loop
(100, 1000)
2 loops, best of 5: 156 ms per loop
(40, 2500)
2 loops, best of 5: 163 ms per loop
(20, 5000)
2 loops, best of 5: 175 ms per loop
(10, 10000)
2 loops, best of 5: 215 ms per loop
(5, 20000)
2 loops, best of 5: 194 ms per loop
(1, 100000)
2 loops, best of 5: 350 ms per loop |
|
refactor to move to_native_types into internals and index 2a3211498c89930df3ebb99de145506e54e2a60d |
|
done. looks good. |
|
in theory should add a test for this: format a multi index that has a datetime embedded as I sort of cheated on the mi and just called tolist |
|
Famous last words if ever I heard them :). |
|
true but wasn't tested before! |
|
Travis is completely red on the new commit. |
|
jreback/pandas@e699fb6 |
|
your might want to add a check and fail early if duplicate columns and not a mi |
|
almost iambic pentameter, but not quite. sorry, I lost you. what fails if there are dupe columns? |
|
oh, I did use a dict for the colname -> index mapping. that'll cause |
|
I put a test for this (which does raise) something like mixed duplicates it fails correctly, but the I think should have it fail much earlier with a better exception |
|
with adding an option for this? maybe append? (default to False) |
|
In a seperate PR. We have Excelwriter for that, makes sense for CSVWriter to allow this as well. |
|
any issues remaining in this? |
|
fyi....just to compare vs writing to HDFStore as a table, same frame as testing here, 100k rows, takes about the same now, 200ms |
|
That's encouraging, means we're probably done. I want to add tests for dataframes dimensioned in and around "magic numbers" I'll definitely get this into 0.11, but I'm putting it off for a few days, |
|
np...take time I was trying to see if writing to a table was a problem, but the writing is already block separated and in cython (slow part actually is a non-trivial issue because it needs a rec-array under the hood, and even though i am using cython to do this, i think the actual usage is quite slow)....but pretty good for now |
|
and let's get a nod from @wesm before merging. |
|
@wesm ok to merge for 0.11...pretty nice imporovement, no api change (except addition of chunksize parms) |
|
|
|
in 0.10.1
e.g. just try printing it
happens to work, but this I am not even sure if its right (I think there is some overwriting, but very hard to tell) |
|
was this created by the new dtypes work you did for 0.11, or was it always there? |
|
sorry, was present in 0.10.1 you said. |
|
actually concat allows the creation of frame with the same column names across dtypes, but 0.10.1 doesn't, so the to_csv is sort of a red-herring..... I guess I should fix concat so that this is not legal |
|
ok...finally tracked this down.....to_csv 'works' with duplicated columns (e.g. a non-unique index), but IS wrong in 0.10.1, with multiple dtypes it 'works' as well, but again IS wrong it is overwriting the data in the series[k](where k is column name), so , going to put in an explicit exception for this |
|
2400877d9764fab7323ef5b783c2d858521f8f64 |
|
You have some conflict crud in there. |
|
i know...i forgot to remove some stuff when i rebased, I don't know how to get rid of it (as its deletions.....)? |
|
that's ok, i'll deal with it. |
|
Merge away. Nice teamwork guys |
y-p
merged commit 4d9a3d3
into pandas-dev:master
Mar 19, 2013
|
One more small perf tweak. Heavy testing found only one bug, fixed. |
|
AWESOME! |
y-p
deleted the
unknown repository branch
Mar 19, 2013
|
typos in RELEASE.rst the 3059 link is pointing to 3039, and df.to_csv
|
|
fixed. |
|
fyi vs 0.10.1
|
|
We should jigger the code so the first column of the test names spells "AwesomeRelease". If I had to pick one thing, I'd say |
|
hahah....fyi I put up a commit on the centered mov stuff when you have a chance....all fixed except for rolling_count, which I am not quite sure what the results should be... |
|
btw, the final perf tweak was all about avoiding iteritems and it's reliance |
|
I saw your data pre-allocation, did you change iteritems somewhere? fyi....as a side issue, that's the problem in the duplicate col names, essentially the col label maps to the column in the block, but there isn't a simple map, in theory should be a positional map, but very tricky |
|
@jreback , can you follow up on the SO question, and make a pandas user happy? |
|
Your refactor calculated Come to think of it, iteritems is col oriented, I wonder if all that could have |
|
iterrows does a lot of stuff....and the block based approach is better anyhow...I will answer the so question |
|
The tests I added for duplicate columns was buggy, and didn't catch the fact v10.1 to_csv() mangles dupe names into "dupe.1,dupe.2," etc. That's an ok workaround, correction: it's read_csv that does the mangling. |
|
the issue is that the blocks have a 2-d of the items in a particular block, the mapping between where it is in the block and the frame is depedent on a unique name of the columns (in the case of a mi this is fine of course). There isn't a positional map (from columns in the frame to in the block). Keeping one is pretty tricky. Wes has a failry complicated routing to find the correct column even with the duplicates, and it succeeds unless they are across blocks (which is the case that I am testing here). I suppose you could temporarily do a rename, on the frame, (no idea if that works), then iterate on the blocks. which will solve the problem. As I said, 0.10.1 actually prints the data out twice. I think raising is ok here, very unlikely to happen, and if it does you can just put a mi in the first place. |
|
I don't follow. If you can display a repr for a frame with dupe columns, you can write it out to csv. |
|
can't display a repr either......I will create an issue for this.. |
|
We're probably thinking of different things. what do you mean you can't repr a frame
on the other hand
That's completely messed up. |
|
I think I can rejig CSVWriter to do this properly, or at least discover what it is that I'm |
|
a bit contrived, but this is the case that's the issue (yours we should allow)
|
|
So:
Then we can think about the general case. |
y-p
restored the
unknown repository branch
Mar 19, 2013
y-p
deleted the
unknown repository branch
Mar 19, 2013
jreback
referenced
this pull request
Mar 19, 2013
Closed
ENH: create BlockManager positional indexer (for easier dupe cols support) #3092
|
dupe columns common case fixed in master via 1f138a4. |
y-p
referenced
this pull request
Mar 20, 2013
Closed
optimize to_csv if formatting isn't needed #3054
|
Changed 'legacy' keyword to engine=='python', to be consistent with c_parser. |
dragoljub
commented
Apr 6, 2013
|
Guys this is truly amazing work. With the performance of reading/writing CSVs Pandas has become truly has enterprise leading I/O performance. Recently I was reading some zipped CSV files to DF at ~40MB/sec thinking to myself how much faster this is than many distributed 'Hadoop' solutions I've seen... :) |
y-p commentedMar 15, 2013
Needs more testing before merging.
Following SO question mentioned in #3054:
wins:
stringify calls.
colsloop range ratherthen creating and walking a generator at each iteration of the inner loop.