-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BENCH: Add benchmarks for np.loadtxt reading from CSV format files #11422
Conversation
@mattip Has noticed that the new benchmarks don't work with Python 2.7 yet & suggested that I also add a datetime data parsing example; I'm working on both. |
Could also think about doing a quick "dry-run" of the asv suite in CircleCI so that PRs breaking the suite are caught, but that's probably out of scope for this specific PR. |
The new benchmarks should now be Python 2.7 & 3.6 compatible. The new datetime asv benchmark has also been added for csv reading by loadtxt and using command:
I get the following (simplified) results for 2.7 and 3.6:
|
Although these are perhaps looking a bit more sensible now, @mattip has noticed that these new benchmarks do produce I had noticed that, but typically ignore warnings from asv runs as long as the benchmarks run; perhaps I'll see if I can avoid this though. |
Initial debug observations here as far as the
|
In short, calling The issue I'm having is that placing the seek(0) within setup, teardown, or even both, doesn't strictly enforce the rewind -- only placing seek(0) within the benchmark function proper does that, but that's not an acceptable solution obviously. |
In the absence of an immediate solution with Perhaps a bit crude to use an |
Ok, I've revised this PR to resolve the issue with The same would also be true for a benchmark that i.e., looks at sorting a list -- each This means that you can't successfully use
Perhaps I'll open an issue with |
The alternative to I think I prefer the explicit seeking though -- let |
benchmarks/benchmarks/bench_io.py
Outdated
@@ -81,6 +85,11 @@ def time_comment_loadtxt_csv(self, num_lines): | |||
|
|||
# inspired by similar benchmark in pandas | |||
# for read_csv | |||
|
|||
# need to rewind StringIO object (unfortunately | |||
# confounding timing result somewhat) for every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do the rewinding at the end of the load, not the start? - that way the first time should cost the same as subsequent times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea--the only reservation I'd have on that is that it may actually be a good thing if the first trial is faster (or as fast as possible) since timeit docs suggest:
It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in.
I suspect that's also why asv
does "warm-up" runs to get things cached and so on for fastest possible timings, but I'm happy to change the order if suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other opinions, see the discussion in the perf library which prefers the average +/- std
as a more realistic estimation of how sensitive your code is to machine variations. I think the warm up is for JITs like PyPy or Numba that need time to get going, although there is also discussion on the speed.python.org mailing list whether it is valid to discard the warm up time, since it too is part of the time to run a program.
Short warmup is also beneficial for newly launched CPython processes,
as you can see in
```
import numpy as np
import time
ts = []
for j in range(100):
start =
time.time()
np.ones([100]).sum
ts.append(time.time() - start)
impo
rt matplotlib.pyplot as plt
plt.plot(np.arange(len(ts)), ts, 'o-')
plt.sa
vefig('out.png')
```
|
Any resolution as to what to do here? |
I'm happy to just revise to satisfy the suggestion from @eric-wieser Perhaps also worth noting that the pandas team gave different feedback on a similar PR: pandas-dev/pandas#21807 They were mostly concerned about people forgetting to rewind StringIO object in the future so wanted me to write some kind of class property--not sure if the NumPy team would like that here as well? |
7765d3c
to
da27485
Compare
Revised based on comment from @eric-wieser to rewind after |
The equivalent pandas PR was merged--any additional requests here? |
It would be nice to sync our solutions for the same problems across the different projects as much as possible. Two reasons off the top of my head:
So I would prefer to accept the pandas PR solution here. @eric-wieser, thoughts? |
@mattip: I don't think we should follow pandas just because they wrote the benchmark first. In my opinion, it's sloppy to do cleanup at the start of a function - it belongs at the end. If we want things in sync, we should fixup the pandas benchmark. |
benchmarks/benchmarks/bench_io.py
Outdated
index += 1 | ||
|
||
# expand data to specified number of lines | ||
data = date_line * int(num_lines / 20.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better as num_lines // 20
?
benchmarks/benchmarks/bench_io.py
Outdated
# must rewind StringIO because of state | ||
# dependence of file reading | ||
np.loadtxt(self.csv_data, | ||
delimiter=',', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be u','
to match the unicode csv_data
?
da27485
to
988dd8e
Compare
Revised to add requested unicode delimiters and floor division in place of Still no errors or warnings when running the revised benchmarks in 2.7 and 3.6 locally. |
benchmarks/benchmarks/bench_io.py
Outdated
index = 0 | ||
|
||
for entry in dates: | ||
date_line += (str(entry) + ',' + str(values[index]) + '\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some silent str
-> unicode
conversion happening here on python2, but I suppose all the strings are ascii anyway
benchmarks/benchmarks/bench_io.py
Outdated
date_line = u'' | ||
index = 0 | ||
|
||
for entry in dates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better as for date, value in zip(dates, values)
rather than the unpythonic index
variable
* NumPy currently has no asv benchmarks for np.loadtxt; now adding a number of benchmarks for np.loadtxt handling of CSV format files to prevent performance regressions
988dd8e
to
f25b18f
Compare
Ok, revised again based on the feedback from Eric; still no Warnings or Errors when running the CSV benchmarks locally in 2.7 and 3.6. Side notes out of scope for current PR:
Let me know if any of the adjustments above would be helpful / justified; the original motivation here was to provide a basis for improving the performance of |
@eric-wieser Good to go? I wish the approval tags had a date... |
Can't comment on whether 90 more seconds of CI is a bad thing, but patch looks fine to me. |
I think we can live with the 90 sec, at least for a while. I'm thinking we should spend some of the NumPy funding to improve our testing times. |
Thanks Tyler. |
NumPy currently has no
asv
benchmarks fornp.loadtxt
; this PR is an attempt to provide a set ofasv
benchmarks that may be used as a starting point for assessing future performance improvements tonp.loadtxt
handling of CSV files, though parsing of enormous CSV files may obviously be assessed separately from the conventionalasv
suite.Many of the benchmarks here are inspired by those used for
pandas.read_csv
, but some modifications were obviously needed given the different capabilities, APIs, etc.Sample new benchmark results from running on
HEAD
locally withasv run -e -b "Loadtxt*"
:(some output snipped for clarity)