BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422

tylerjereddy · 2018-06-25T23:25:01Z

NumPy currently has no asv benchmarks for np.loadtxt; this PR is an attempt to provide a set of asv benchmarks that may be used as a starting point for assessing future performance improvements to np.loadtxt handling of CSV files, though parsing of enormous CSV files may obviously be assessed separately from the conventional asv suite.

Many of the benchmarks here are inspired by those used for pandas.read_csv, but some modifications were obviously needed given the different capabilities, APIs, etc.

Sample new benchmark results from running on HEAD locally with asv run -e -b "Loadtxt*":

(some output snipped for clarity)

· Running 7 total benchmarks (1 commits * 1 environments * 7 benchmarks)           
[  0.00%] · For numpy commit hash 464f79eb:                                        
[  0.00%] ·· Building for virtualenv-py3.6-six                                     
[  0.00%] ·· Benchmarking virtualenv-py3.6-six                                     
[ 14.29%] ··· Running ....LoadtxtCSVComments.time_comment_loadtxt_csv         ok  
[ 14.29%] ····                                                                     
               =========== ============                                            
                num_lines                                                          
               ----------- ------------                                            
                    10       31.9±1μs                                              
                   100      32.2±0.8μs                                             
                  10000      45.9±1μs                                              
                  100000     597±2ms                                               
               =========== ============      

[ 28.57%] ··· Running bench_io.LoadtxtCSVSkipRows.time_skiprows_csv           ok  
[ 28.57%] ····                                                                     
               ========== =======                                                  
                skiprows                                                           
               ---------- -------                                                  
                   0       741ms                                                   
                  500      718ms                                                   
                 10000     681ms                                                   
               ========== =======     
[ 42.86%] ··· Running ...tCSVStructured.time_loadtxt_csv_struct_dtype    439±5ms
[ 57.14%] ··· Running ..._io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv         ok  
[ 57.14%] ····                                                                     
               ============ =========== ============                               
                  dtype      num_lines                                             
               ------------ ----------- ------------                               
                 float32         10       29.9±1μs                                 
                 float32        100      31.5±0.4μs                                
                 float32       10000     49.8±0.8μs                                
                 float32       100000    581±0.5ms                                 
                 float64         10      30.6±0.3μs                                
                 float64        100      30.1±0.2μs                                
                 float64       10000     45.3±0.1μs                                
                 float64       100000     609±2ms                                  
                  int32          10      29.3±0.2μs                                
                  int32         100      29.8±0.6μs                                
                  int32        10000     45.8±0.4μs                                
                  int32        100000    575±0.9ms                                 
                  int64          10      30.2±0.5μs                                
                  int64         100       31.3±1μs                                 
                  int64        10000      51.5±1μs                                 
                  int64        100000      746ms                                   
                complex128       10      32.0±0.2μs                                
                complex128      100      32.1±0.5μs                                
                complex128     10000      62.9±3μs                                 
                complex128     100000      754ms                                   
                   str           10       33.2±1μs                                 
                   str          100      32.4±0.6μs                                
                   str         10000     47.3±0.6μs                                
                   str         100000     574±2ms                                  
                  object         10      33.2±0.4μs                                
                  object        100      32.4±0.3μs                                
                  object       10000     47.0±0.5μs                                
                  object       100000     561±1ms                                  
               ============ =========== ============ 

[ 71.43%] ··· Running ...o.LoadtxtReadUint64Integers.time_read_uint64         ok
[ 71.43%] ····                                                                  
               ======= ============                                             
                 size                                                           
               ------- ------------                                             
                 550     29.7±1μs                                               
                 1000   29.7±0.3μs                                              
                10000   36.2±0.6μs                                              
               ======= ============                                             
                                                                                
[ 71.43%] ·····         

[ 85.71%] ··· Running ...adUint64Integers.time_read_uint64_neg_values         ok
[ 85.71%] ····                                                                  
               ======= ============                                             
                 size                                                           
               ------- ------------                                             
                 550    30.6±0.4μs                                              
                 1000   28.8±0.5μs                                              
                10000   35.5±0.2μs                                              
               ======= ============         
[100.00%] ··· Running ...o.LoadtxtUseColsCSV.time_loadtxt_usecols_csv         ok
[100.00%] ····                                                                  
               ============== ============                                      
                  usecols                                                       
               -------------- ------------                                      
                     2         35.9±0.4μs                                       
                   [1, 3]      36.8±0.2μs                                       
                [1, 3, 5, 7]    40.9±1μs                                        
               ============== ============

tylerjereddy · 2018-06-26T22:27:26Z

@mattip Has noticed that the new benchmarks don't work with Python 2.7 yet & suggested that I also add a datetime data parsing example; I'm working on both.

tylerjereddy · 2018-06-26T22:30:14Z

Could also think about doing a quick "dry-run" of the asv suite in CircleCI so that PRs breaking the suite are caught, but that's probably out of scope for this specific PR.

tylerjereddy · 2018-06-26T23:11:57Z

The new benchmarks should now be Python 2.7 & 3.6 compatible. The new datetime asv benchmark has also been added for csv reading by loadtxt and using command:

asv run -e -b "LoadtxtCSVDateTime"

I get the following (simplified) results for 2.7 and 3.6:

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-six
·· Installing into conda-py2.7-six..
· Running 2 total benchmarks (1 commits * 2 environments * 1 benchmarks)
[  0.00%] · For numpy commit hash 464f79eb:
[  0.00%] ·· Building for conda-py2.7-six..
[  0.00%] ·· Benchmarking conda-py2.7-six
[ 50.00%] ··· Running bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime                                                                                                                                                                                                                                                                                                                                                                                                                          ok
[ 50.00%] ···· 
               =========== ============
                num_lines              
               ----------- ------------
                    20      71.1±0.8μs 
                   200      81.3±0.5μs 
                   2000      248±1ms   
                  20000       2.51s    
               =========== ============

[ 50.00%] ····· 
                            
[ 50.00%] ·· Building for conda-py3.6-six...
[ 50.00%] ·· Benchmarking conda-py3.6-six
[100.00%] ··· Running bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime                                                                                                                                                                                                                                                                                                                                                                                                                          ok
[100.00%] ···· 
               =========== ============
                num_lines              
               ----------- ------------
                    20      38.8±0.6μs 
                   200      43.9±0.2μs 
                   2000      215±2ms   
                  20000       2.12s    
               =========== ============

[100.00%] ·····

tylerjereddy · 2018-06-27T17:47:17Z

Although these are perhaps looking a bit more sensible now, @mattip has noticed that these new benchmarks do produce UserWarning: loadtxt: Empty input file: "<_io.StringIO object at 0x106ac8850>".

I had noticed that, but typically ignore warnings from asv runs as long as the benchmarks run; perhaps I'll see if I can avoid this though.

tylerjereddy · 2018-06-29T00:37:45Z

Initial debug observations here as far as the UserWarning from loadtxt during asv benchmarks goes:

the user warning goes away if the setup() method is removed & all of its content placed in the actual benchmark timing test (doing this is bad, but the point was to debug)
- similarly, if I retain the setup() method and its code, but hold off on generating the StringIO object until the benchmark test proper (still bad because confounds timing), I can suppress the user warning
I can't reproduce the UserWarning using similar code outside of the asv framework; this might suggest that StringIO in the asv setup() method may be fragile / different somehow when passed through to the test proper
it seems to be possible to suppress the UserWarning by using larger StringIO objects (more csv lines) in the setup()

tylerjereddy · 2018-06-29T19:47:10Z

In short, calling np.loadtxt() on the same StringIO object twice in a row generates a UserWarning because you first need to rewind the object before it can be read properly again.

The issue I'm having is that placing the seek(0) within setup, teardown, or even both, doesn't strictly enforce the rewind -- only placing seek(0) within the benchmark function proper does that, but that's not an acceptable solution obviously.

tylerjereddy · 2018-06-29T22:14:39Z

@pv any ideas? I tried to debug this with @stefanv and I think we were both confused because the same StringIO object was being used by setup() more than it should have (i.e., we don't get 10 unique StringIO objects even though setup seems to get called 10 times per benchmark (for the repeats).

tylerjereddy · 2018-07-03T05:06:56Z

In the absence of an immediate solution with asv, here is a benchmark summary plot from a simple script comparing average +- std. dev. timings for pandas.read_csv() and np.loadtxt() (on a log scale) using master branches of each project:

Perhaps a bit crude to use an ad hoc timing script, but perhaps also sufficient as a starting point for evaluating relative improvements?

tylerjereddy · 2018-07-07T22:50:48Z

Ok, I've revised this PR to resolve the issue with UserWarning in benchmarks. In short, asv benchmarks DO call setup() between all repeats, but NOT between iterations within repeats (the same holds true within timeit itself, which is what asv uses under the hood).

The same would also be true for a benchmark that i.e., looks at sorting a list -- each time_sort call won't have a unique copy of the unsorted list produced in setup() so you have to copy the unsorted list within the benchmark test itself -- see discussion by Raymond Hettinger.

This means that you can't successfully use np.loadtxt on the same StringIO object without rewinding it first, but this doesn't hold true for a reference to a file proper that has been written to disk; you'll even notice that one of the benchmarks I wrote here does use a file proper & doesn't require a rewind.

pandas also uses StringIO in their csv reading benchmarks, and we use StringIO in our np.loadtxt docstring, and there are perhaps some advantages to avoiding the actual writing of large files to disk for the suite. But if the reviewer preference here is to just use actual file objects and avoid the rewinding noise, I'd be ok with that too.

Perhaps I'll open an issue with asv to suggest an improvement there, but realistically this mostly boils down to how timeit works at a fundamental level. Perhaps an asv construct that allows subtracting the time required for certain repeat tasks (like seeking back to 0) would be cool.

tylerjereddy · 2018-07-12T11:54:38Z

The alternative to seek(0) would be to force a single iteration with number=1. This also works for preventing stale reads off the end of StringIO objects, though one has to be careful to realize that the warmup runs that asv performs will still generate a UserWarning because the warmups won't necessarily call setup() regularly. I see that if you set number=1 the StringIO objects that are "recreated" actually share the same memory address sometimes, so Python or asv must be doing something smart there.

I think I prefer the explicit seeking though -- let asv and timeit do their things as far as deciding the appropriate repeat and iteration counts, and just accept that there's some seek() time mixed in to each result.

eric-wieser · 2018-07-12T13:30:55Z

benchmarks/benchmarks/bench_io.py

@@ -81,6 +85,11 @@ def time_comment_loadtxt_csv(self, num_lines):

        # inspired by similar benchmark in pandas
        # for read_csv
+
+        # need to rewind StringIO object (unfortunately
+        # confounding timing result somewhat) for every


Why not do the rewinding at the end of the load, not the start? - that way the first time should cost the same as subsequent times

That's an interesting idea--the only reservation I'd have on that is that it may actually be a good thing if the first trial is faster (or as fast as possible) since timeit docs suggest:

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in.

I suspect that's also why asv does "warm-up" runs to get things cached and so on for fastest possible timings, but I'm happy to change the order if suitable.

There are other opinions, see the discussion in the perf library which prefers the average +/- stdas a more realistic estimation of how sensitive your code is to machine variations. I think the warm up is for JITs like PyPy or Numba that need time to get going, although there is also discussion on the speed.python.org mailing list whether it is valid to discard the warm up time, since it too is part of the time to run a program.

pv · 2018-07-12T16:21:02Z

Short warmup is also beneficial for newly launched CPython processes, as you can see in ``` import numpy as np import time ts = [] for j in range(100): start = time.time() np.ones([100]).sum ts.append(time.time() - start) impo rt matplotlib.pyplot as plt plt.plot(np.arange(len(ts)), ts, 'o-') plt.sa vefig('out.png') ```

charris · 2018-07-23T00:55:24Z

Any resolution as to what to do here?

tylerjereddy · 2018-07-23T03:31:51Z

I'm happy to just revise to satisfy the suggestion from @eric-wieser

Perhaps also worth noting that the pandas team gave different feedback on a similar PR: pandas-dev/pandas#21807

They were mostly concerned about people forgetting to rewind StringIO object in the future so wanted me to write some kind of class property--not sure if the NumPy team would like that here as well?

tylerjereddy · 2018-07-26T20:27:05Z

Revised based on comment from @eric-wieser to rewind after loadtxt runs each time, then squashed / rebased / force pushed.

tylerjereddy · 2018-08-01T19:22:21Z

The equivalent pandas PR was merged--any additional requests here?

mattip · 2018-08-08T16:53:31Z

It would be nice to sync our solutions for the same problems across the different projects as much as possible. Two reasons off the top of my head:

reduce confusion for developers who commit to both projects
reduce possible sources of contention when comparing benchmark results across projects

So I would prefer to accept the pandas PR solution here. @eric-wieser, thoughts?

eric-wieser · 2018-08-08T17:03:01Z

@mattip: I don't think we should follow pandas just because they wrote the benchmark first. In my opinion, it's sloppy to do cleanup at the start of a function - it belongs at the end. If we want things in sync, we should fixup the pandas benchmark.

eric-wieser · 2018-08-08T17:04:31Z

benchmarks/benchmarks/bench_io.py

+            index += 1
+
+        # expand data to specified number of lines
+        data = date_line * int(num_lines / 20.)


Better as num_lines // 20?

eric-wieser · 2018-08-08T17:05:11Z

benchmarks/benchmarks/bench_io.py

+        # must rewind StringIO because of state
+        # dependence of file reading
+        np.loadtxt(self.csv_data,
+                   delimiter=',',


Should these be u',' to match the unicode csv_data?

tylerjereddy · 2018-08-08T18:05:36Z

Revised to add requested unicode delimiters and floor division in place of int( a / b.) style.

Still no errors or warnings when running the revised benchmarks in 2.7 and 3.6 locally.

eric-wieser · 2018-08-09T04:35:49Z

benchmarks/benchmarks/bench_io.py

+        index = 0
+
+        for entry in dates:
+            date_line += (str(entry) + ',' + str(values[index]) + '\n')


There's some silent str -> unicode conversion happening here on python2, but I suppose all the strings are ascii anyway

eric-wieser · 2018-08-09T04:36:31Z

benchmarks/benchmarks/bench_io.py

+        date_line = u''
+        index = 0
+
+        for entry in dates:


Better as for date, value in zip(dates, values) rather than the unpythonic index variable

* NumPy currently has no asv benchmarks for np.loadtxt; now adding a number of benchmarks for np.loadtxt handling of CSV format files to prevent performance regressions

tylerjereddy · 2018-08-09T17:27:48Z

Ok, revised again based on the feedback from Eric; still no Warnings or Errors when running the CSV benchmarks locally in 2.7 and 3.6.

Side notes out of scope for current PR:

We currently use CI to check that asv benchmarks don't break, but only for Python 2.7 because that's what is in our asv.conf.json; probably worth switching that to 3.6 or 3.7 when we're close to dropping 2.7, or perhaps 2.7 & 3.6 for a while to be thorough (though that would double run time on the CI job in the matrix)
Our quality control for asv not "breaking" is to check if the string Traceback is present in the log file -- if we want to be really strict and catch warnings as well (those caused "silent" issues here), we could also expand the grep to check for warning-related string(s); I see that there are some warnings we are ignoring at the moment
Note that changes here seem to increase the time for asv run CI job by about 90 seconds, give or take the uncertainty of the Travis hardware; checking for long running benchmarks can be easily missed if the checkmark is green, but maybe that's reasonable

Let me know if any of the adjustments above would be helpful / justified; the original motivation here was to provide a basis for improving the performance of loadtxt with CSV files, but the community has largely provided feedback guiding time investment elsewhere, so this is mostly just about preventing performance regressions now.

charris · 2018-08-18T19:14:03Z

@eric-wieser Good to go? I wish the approval tags had a date...

eric-wieser · 2018-08-20T16:12:04Z

Can't comment on whether 90 more seconds of CI is a bad thing, but patch looks fine to me.

charris · 2018-08-20T16:29:52Z

I think we can live with the 90 sec, at least for a while. I'm thinking we should spend some of the NumPy funding to improve our testing times.

charris · 2018-08-20T16:30:13Z

Thanks Tyler.

This was referenced Jul 7, 2018

BENCH: asv csv reading benchmarks now rewind StringIO objects pandas-dev/pandas#21807

Merged

setup() granularity and subtracting related tasks airspeed-velocity/asv#664

Closed

eric-wieser reviewed Jul 12, 2018

View reviewed changes

tylerjereddy force-pushed the loadtxt_csv_asv_bench branch from 7765d3c to da27485 Compare July 26, 2018 20:24

charris added 01 - Enhancement component: benchmarks labels Aug 1, 2018

eric-wieser reviewed Aug 8, 2018

View reviewed changes

tylerjereddy force-pushed the loadtxt_csv_asv_bench branch from da27485 to 988dd8e Compare August 8, 2018 18:03

eric-wieser reviewed Aug 9, 2018

View reviewed changes

eric-wieser approved these changes Aug 9, 2018

View reviewed changes

eric-wieser reviewed Aug 9, 2018

View reviewed changes

BENCH: Add benchmarks for np.loadtxt reading from CSV format

f25b18f

* NumPy currently has no asv benchmarks for np.loadtxt; now adding a number of benchmarks for np.loadtxt handling of CSV format files to prevent performance regressions

tylerjereddy force-pushed the loadtxt_csv_asv_bench branch from 988dd8e to f25b18f Compare August 9, 2018 17:13

charris merged commit 52c67eb into numpy:master Aug 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422

BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422

tylerjereddy commented Jun 25, 2018

tylerjereddy commented Jun 26, 2018

tylerjereddy commented Jun 26, 2018

tylerjereddy commented Jun 26, 2018

tylerjereddy commented Jun 27, 2018

tylerjereddy commented Jun 29, 2018

tylerjereddy commented Jun 29, 2018

tylerjereddy commented Jun 29, 2018

tylerjereddy commented Jul 3, 2018

tylerjereddy commented Jul 7, 2018

tylerjereddy commented Jul 12, 2018

eric-wieser Jul 12, 2018

tylerjereddy Jul 12, 2018

mattip Jul 12, 2018

pv commented Jul 12, 2018 via email

charris commented Jul 23, 2018

tylerjereddy commented Jul 23, 2018

tylerjereddy commented Jul 26, 2018

tylerjereddy commented Aug 1, 2018 •

edited

Loading

mattip commented Aug 8, 2018

eric-wieser commented Aug 8, 2018 •

edited

Loading

eric-wieser Aug 8, 2018 •

edited

Loading

eric-wieser Aug 8, 2018

tylerjereddy commented Aug 8, 2018 •

edited

Loading

eric-wieser Aug 9, 2018

eric-wieser Aug 9, 2018 •

edited

Loading

tylerjereddy commented Aug 9, 2018

charris commented Aug 18, 2018

eric-wieser commented Aug 20, 2018 •

edited

Loading

charris commented Aug 20, 2018

charris commented Aug 20, 2018

BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422

BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422

Conversation

tylerjereddy commented Jun 25, 2018

tylerjereddy commented Jun 26, 2018

tylerjereddy commented Jun 26, 2018

tylerjereddy commented Jun 26, 2018

tylerjereddy commented Jun 27, 2018

tylerjereddy commented Jun 29, 2018

tylerjereddy commented Jun 29, 2018

tylerjereddy commented Jun 29, 2018

tylerjereddy commented Jul 3, 2018

tylerjereddy commented Jul 7, 2018

tylerjereddy commented Jul 12, 2018

eric-wieser Jul 12, 2018

Choose a reason for hiding this comment

tylerjereddy Jul 12, 2018

Choose a reason for hiding this comment

mattip Jul 12, 2018

Choose a reason for hiding this comment

pv commented Jul 12, 2018 via email

charris commented Jul 23, 2018

tylerjereddy commented Jul 23, 2018

tylerjereddy commented Jul 26, 2018

tylerjereddy commented Aug 1, 2018 • edited Loading

mattip commented Aug 8, 2018

eric-wieser commented Aug 8, 2018 • edited Loading

eric-wieser Aug 8, 2018 • edited Loading

Choose a reason for hiding this comment

eric-wieser Aug 8, 2018

Choose a reason for hiding this comment

tylerjereddy commented Aug 8, 2018 • edited Loading

eric-wieser Aug 9, 2018

Choose a reason for hiding this comment

eric-wieser Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

tylerjereddy commented Aug 9, 2018

charris commented Aug 18, 2018

eric-wieser commented Aug 20, 2018 • edited Loading

charris commented Aug 20, 2018

charris commented Aug 20, 2018

tylerjereddy commented Aug 1, 2018 •

edited

Loading

eric-wieser commented Aug 8, 2018 •

edited

Loading

eric-wieser Aug 8, 2018 •

edited

Loading

tylerjereddy commented Aug 8, 2018 •

edited

Loading

eric-wieser Aug 9, 2018 •

edited

Loading

eric-wieser commented Aug 20, 2018 •

edited

Loading