Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422

Merged
merged 1 commit into from
Aug 20, 2018

Conversation

tylerjereddy
Copy link
Contributor

NumPy currently has no asv benchmarks for np.loadtxt; this PR is an attempt to provide a set of asv benchmarks that may be used as a starting point for assessing future performance improvements to np.loadtxt handling of CSV files, though parsing of enormous CSV files may obviously be assessed separately from the conventional asv suite.

Many of the benchmarks here are inspired by those used for pandas.read_csv, but some modifications were obviously needed given the different capabilities, APIs, etc.

Sample new benchmark results from running on HEAD locally with asv run -e -b "Loadtxt*":

(some output snipped for clarity)

· Running 7 total benchmarks (1 commits * 1 environments * 7 benchmarks)           
[  0.00%] · For numpy commit hash 464f79eb:                                        
[  0.00%] ·· Building for virtualenv-py3.6-six                                     
[  0.00%] ·· Benchmarking virtualenv-py3.6-six                                     
[ 14.29%] ··· Running ....LoadtxtCSVComments.time_comment_loadtxt_csv         ok  
[ 14.29%] ····                                                                     
               =========== ============                                            
                num_lines                                                          
               ----------- ------------                                            
                    10       31.9±1μs                                              
                   100      32.2±0.8μs                                             
                  10000      45.9±1μs                                              
                  100000     597±2ms                                               
               =========== ============      

[ 28.57%] ··· Running bench_io.LoadtxtCSVSkipRows.time_skiprows_csv           ok  
[ 28.57%] ····                                                                     
               ========== =======                                                  
                skiprows                                                           
               ---------- -------                                                  
                   0       741ms                                                   
                  500      718ms                                                   
                 10000     681ms                                                   
               ========== =======     
[ 42.86%] ··· Running ...tCSVStructured.time_loadtxt_csv_struct_dtype    439±5ms
[ 57.14%] ··· Running ..._io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv         ok  
[ 57.14%] ····                                                                     
               ============ =========== ============                               
                  dtype      num_lines                                             
               ------------ ----------- ------------                               
                 float32         10       29.9±1μs                                 
                 float32        100      31.5±0.4μs                                
                 float32       10000     49.8±0.8μs                                
                 float32       100000    581±0.5ms                                 
                 float64         10      30.6±0.3μs                                
                 float64        100      30.1±0.2μs                                
                 float64       10000     45.3±0.1μs                                
                 float64       100000     609±2ms                                  
                  int32          10      29.3±0.2μs                                
                  int32         100      29.8±0.6μs                                
                  int32        10000     45.8±0.4μs                                
                  int32        100000    575±0.9ms                                 
                  int64          10      30.2±0.5μs                                
                  int64         100       31.3±1μs                                 
                  int64        10000      51.5±1μs                                 
                  int64        100000      746ms                                   
                complex128       10      32.0±0.2μs                                
                complex128      100      32.1±0.5μs                                
                complex128     10000      62.9±3μs                                 
                complex128     100000      754ms                                   
                   str           10       33.2±1μs                                 
                   str          100      32.4±0.6μs                                
                   str         10000     47.3±0.6μs                                
                   str         100000     574±2ms                                  
                  object         10      33.2±0.4μs                                
                  object        100      32.4±0.3μs                                
                  object       10000     47.0±0.5μs                                
                  object       100000     561±1ms                                  
               ============ =========== ============ 

[ 71.43%] ··· Running ...o.LoadtxtReadUint64Integers.time_read_uint64         ok
[ 71.43%] ····                                                                  
               ======= ============                                             
                 size                                                           
               ------- ------------                                             
                 550     29.7±1μs                                               
                 1000   29.7±0.3μs                                              
                10000   36.2±0.6μs                                              
               ======= ============                                             
                                                                                
[ 71.43%] ·····         

[ 85.71%] ··· Running ...adUint64Integers.time_read_uint64_neg_values         ok
[ 85.71%] ····                                                                  
               ======= ============                                             
                 size                                                           
               ------- ------------                                             
                 550    30.6±0.4μs                                              
                 1000   28.8±0.5μs                                              
                10000   35.5±0.2μs                                              
               ======= ============         
[100.00%] ··· Running ...o.LoadtxtUseColsCSV.time_loadtxt_usecols_csv         ok
[100.00%] ····                                                                  
               ============== ============                                      
                  usecols                                                       
               -------------- ------------                                      
                     2         35.9±0.4μs                                       
                   [1, 3]      36.8±0.2μs                                       
                [1, 3, 5, 7]    40.9±1μs                                        
               ============== ============     

@tylerjereddy
Copy link
Contributor Author

@mattip Has noticed that the new benchmarks don't work with Python 2.7 yet & suggested that I also add a datetime data parsing example; I'm working on both.

@tylerjereddy
Copy link
Contributor Author

Could also think about doing a quick "dry-run" of the asv suite in CircleCI so that PRs breaking the suite are caught, but that's probably out of scope for this specific PR.

@tylerjereddy
Copy link
Contributor Author

The new benchmarks should now be Python 2.7 & 3.6 compatible. The new datetime asv benchmark has also been added for csv reading by loadtxt and using command:

asv run -e -b "LoadtxtCSVDateTime"

I get the following (simplified) results for 2.7 and 3.6:

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-six
·· Installing into conda-py2.7-six..
· Running 2 total benchmarks (1 commits * 2 environments * 1 benchmarks)
[  0.00%] · For numpy commit hash 464f79eb:
[  0.00%] ·· Building for conda-py2.7-six..
[  0.00%] ·· Benchmarking conda-py2.7-six
[ 50.00%] ··· Running bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime                                                                                                                                                                                                                                                                                                                                                                                                                          ok
[ 50.00%] ···· 
               =========== ============
                num_lines              
               ----------- ------------
                    20      71.1±0.8μs 
                   200      81.3±0.5μs 
                   2000      248±1ms   
                  20000       2.51s    
               =========== ============

[ 50.00%] ····· 
                            
[ 50.00%] ·· Building for conda-py3.6-six...
[ 50.00%] ·· Benchmarking conda-py3.6-six
[100.00%] ··· Running bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime                                                                                                                                                                                                                                                                                                                                                                                                                          ok
[100.00%] ···· 
               =========== ============
                num_lines              
               ----------- ------------
                    20      38.8±0.6μs 
                   200      43.9±0.2μs 
                   2000      215±2ms   
                  20000       2.12s    
               =========== ============

[100.00%] ····· 

@tylerjereddy
Copy link
Contributor Author

Although these are perhaps looking a bit more sensible now, @mattip has noticed that these new benchmarks do produce UserWarning: loadtxt: Empty input file: "<_io.StringIO object at 0x106ac8850>".

I had noticed that, but typically ignore warnings from asv runs as long as the benchmarks run; perhaps I'll see if I can avoid this though.

@tylerjereddy
Copy link
Contributor Author

Initial debug observations here as far as the UserWarning from loadtxt during asv benchmarks goes:

  • the user warning goes away if the setup() method is removed & all of its content placed in the actual benchmark timing test (doing this is bad, but the point was to debug)
    • similarly, if I retain the setup() method and its code, but hold off on generating the StringIO object until the benchmark test proper (still bad because confounds timing), I can suppress the user warning
  • I can't reproduce the UserWarning using similar code outside of the asv framework; this might suggest that StringIO in the asv setup() method may be fragile / different somehow when passed through to the test proper
  • it seems to be possible to suppress the UserWarning by using larger StringIO objects (more csv lines) in the setup()

@tylerjereddy
Copy link
Contributor Author

In short, calling np.loadtxt() on the same StringIO object twice in a row generates a UserWarning because you first need to rewind the object before it can be read properly again.

The issue I'm having is that placing the seek(0) within setup, teardown, or even both, doesn't strictly enforce the rewind -- only placing seek(0) within the benchmark function proper does that, but that's not an acceptable solution obviously.

@tylerjereddy
Copy link
Contributor Author

@pv any ideas? I tried to debug this with @stefanv and I think we were both confused because the same StringIO object was being used by setup() more than it should have (i.e., we don't get 10 unique StringIO objects even though setup seems to get called 10 times per benchmark (for the repeats).

@tylerjereddy
Copy link
Contributor Author

In the absence of an immediate solution with asv, here is a benchmark summary plot from a simple script comparing average +- std. dev. timings for pandas.read_csv() and np.loadtxt() (on a log scale) using master branches of each project:

image

Perhaps a bit crude to use an ad hoc timing script, but perhaps also sufficient as a starting point for evaluating relative improvements?

@tylerjereddy
Copy link
Contributor Author

Ok, I've revised this PR to resolve the issue with UserWarning in benchmarks. In short, asv benchmarks DO call setup() between all repeats, but NOT between iterations within repeats (the same holds true within timeit itself, which is what asv uses under the hood).

The same would also be true for a benchmark that i.e., looks at sorting a list -- each time_sort call won't have a unique copy of the unsorted list produced in setup() so you have to copy the unsorted list within the benchmark test itself -- see discussion by Raymond Hettinger.

This means that you can't successfully use np.loadtxt on the same StringIO object without rewinding it first, but this doesn't hold true for a reference to a file proper that has been written to disk; you'll even notice that one of the benchmarks I wrote here does use a file proper & doesn't require a rewind.

pandas also uses StringIO in their csv reading benchmarks, and we use StringIO in our np.loadtxt docstring, and there are perhaps some advantages to avoiding the actual writing of large files to disk for the suite. But if the reviewer preference here is to just use actual file objects and avoid the rewinding noise, I'd be ok with that too.

Perhaps I'll open an issue with asv to suggest an improvement there, but realistically this mostly boils down to how timeit works at a fundamental level. Perhaps an asv construct that allows subtracting the time required for certain repeat tasks (like seeking back to 0) would be cool.

@tylerjereddy
Copy link
Contributor Author

The alternative to seek(0) would be to force a single iteration with number=1. This also works for preventing stale reads off the end of StringIO objects, though one has to be careful to realize that the warmup runs that asv performs will still generate a UserWarning because the warmups won't necessarily call setup() regularly. I see that if you set number=1 the StringIO objects that are "recreated" actually share the same memory address sometimes, so Python or asv must be doing something smart there.

I think I prefer the explicit seeking though -- let asv and timeit do their things as far as deciding the appropriate repeat and iteration counts, and just accept that there's some seek() time mixed in to each result.

@@ -81,6 +85,11 @@ def time_comment_loadtxt_csv(self, num_lines):

# inspired by similar benchmark in pandas
# for read_csv

# need to rewind StringIO object (unfortunately
# confounding timing result somewhat) for every
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the rewinding at the end of the load, not the start? - that way the first time should cost the same as subsequent times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea--the only reservation I'd have on that is that it may actually be a good thing if the first trial is faster (or as fast as possible) since timeit docs suggest:

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in.

I suspect that's also why asv does "warm-up" runs to get things cached and so on for fastest possible timings, but I'm happy to change the order if suitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other opinions, see the discussion in the perf library which prefers the average +/- stdas a more realistic estimation of how sensitive your code is to machine variations. I think the warm up is for JITs like PyPy or Numba that need time to get going, although there is also discussion on the speed.python.org mailing list whether it is valid to discard the warm up time, since it too is part of the time to run a program.

@pv
Copy link
Member

pv commented Jul 12, 2018 via email

@charris
Copy link
Member

charris commented Jul 23, 2018

Any resolution as to what to do here?

@tylerjereddy
Copy link
Contributor Author

I'm happy to just revise to satisfy the suggestion from @eric-wieser

Perhaps also worth noting that the pandas team gave different feedback on a similar PR: pandas-dev/pandas#21807

They were mostly concerned about people forgetting to rewind StringIO object in the future so wanted me to write some kind of class property--not sure if the NumPy team would like that here as well?

@tylerjereddy
Copy link
Contributor Author

Revised based on comment from @eric-wieser to rewind after loadtxt runs each time, then squashed / rebased / force pushed.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Aug 1, 2018

The equivalent pandas PR was merged--any additional requests here?

@mattip
Copy link
Member

mattip commented Aug 8, 2018

It would be nice to sync our solutions for the same problems across the different projects as much as possible. Two reasons off the top of my head:

  • reduce confusion for developers who commit to both projects
  • reduce possible sources of contention when comparing benchmark results across projects

So I would prefer to accept the pandas PR solution here. @eric-wieser, thoughts?

@eric-wieser
Copy link
Member

eric-wieser commented Aug 8, 2018

@mattip: I don't think we should follow pandas just because they wrote the benchmark first. In my opinion, it's sloppy to do cleanup at the start of a function - it belongs at the end. If we want things in sync, we should fixup the pandas benchmark.

index += 1

# expand data to specified number of lines
data = date_line * int(num_lines / 20.)
Copy link
Member

@eric-wieser eric-wieser Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better as num_lines // 20?

# must rewind StringIO because of state
# dependence of file reading
np.loadtxt(self.csv_data,
delimiter=',',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be u',' to match the unicode csv_data?

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Aug 8, 2018

Revised to add requested unicode delimiters and floor division in place of int( a / b.) style.

Still no errors or warnings when running the revised benchmarks in 2.7 and 3.6 locally.

index = 0

for entry in dates:
date_line += (str(entry) + ',' + str(values[index]) + '\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some silent str -> unicode conversion happening here on python2, but I suppose all the strings are ascii anyway

date_line = u''
index = 0

for entry in dates:
Copy link
Member

@eric-wieser eric-wieser Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better as for date, value in zip(dates, values) rather than the unpythonic index variable

* NumPy currently has no asv benchmarks for np.loadtxt; now adding
a number of benchmarks for np.loadtxt handling of CSV format files
to prevent performance regressions
@tylerjereddy
Copy link
Contributor Author

Ok, revised again based on the feedback from Eric; still no Warnings or Errors when running the CSV benchmarks locally in 2.7 and 3.6.

Side notes out of scope for current PR:

  • We currently use CI to check that asv benchmarks don't break, but only for Python 2.7 because that's what is in our asv.conf.json; probably worth switching that to 3.6 or 3.7 when we're close to dropping 2.7, or perhaps 2.7 & 3.6 for a while to be thorough (though that would double run time on the CI job in the matrix)
  • Our quality control for asv not "breaking" is to check if the string Traceback is present in the log file -- if we want to be really strict and catch warnings as well (those caused "silent" issues here), we could also expand the grep to check for warning-related string(s); I see that there are some warnings we are ignoring at the moment
  • Note that changes here seem to increase the time for asv run CI job by about 90 seconds, give or take the uncertainty of the Travis hardware; checking for long running benchmarks can be easily missed if the checkmark is green, but maybe that's reasonable

Let me know if any of the adjustments above would be helpful / justified; the original motivation here was to provide a basis for improving the performance of loadtxt with CSV files, but the community has largely provided feedback guiding time investment elsewhere, so this is mostly just about preventing performance regressions now.

@charris
Copy link
Member

charris commented Aug 18, 2018

@eric-wieser Good to go? I wish the approval tags had a date...

@eric-wieser
Copy link
Member

eric-wieser commented Aug 20, 2018

Can't comment on whether 90 more seconds of CI is a bad thing, but patch looks fine to me.

@charris
Copy link
Member

charris commented Aug 20, 2018

I think we can live with the 90 sec, at least for a while. I'm thinking we should spend some of the NumPy funding to improve our testing times.

@charris charris merged commit 52c67eb into numpy:master Aug 20, 2018
@charris
Copy link
Member

charris commented Aug 20, 2018

Thanks Tyler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants