Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do you have any performance test compare with python pandas? #31

Closed
backtradercn opened this issue Jul 27, 2019 · 5 comments
Closed

do you have any performance test compare with python pandas? #31

backtradercn opened this issue Jul 27, 2019 · 5 comments
Labels

Comments

@backtradercn
Copy link

do you have any performance test compare with python pandas?

@hosseinmoein
Copy link
Owner

I don't. But if you have the means to do a meaningful comparison in terms of both performance and scale-ability, it would be great.
Also @yssource

@yssource
Copy link
Contributor

pandas is developed with python, which is absolutely slower then C/C++.
For performance reasons, I decided to replace the algorithm part of my python codes with C++ DataFrame.
And then I use pybind11 porting back to python.

@backtradercn
For a simple speed testing, maybe it will help you.

  1. python timeit module
    import timeit
  2. c++ Boost timer
    #include<boost/timer.hpp>

@hosseinmoein
Copy link
Owner

@backtradercn, @yssource
I have added a performance section to the README file explaining how the new performance test runs

@qingtiandalaoye
Copy link

MacBook> time python pandas_performance.py
All memory allocations are done. Calculating means ...

real 17m18.916s
user 4m47.113s
sys 5m31.901s
MacBook>
MacBook>
MacBook> time ../bin/Linux.GCC64/dataframe_performance
All memory allocations are done. Calculating means ...

real 6m40.222s
user 2m54.362s
sys 2m14.951s

---seems cpp only 2 times faster than python?

@hosseinmoein hosseinmoein reopened this Sep 24, 2019
@hosseinmoein
Copy link
Owner

hosseinmoein commented Sep 25, 2019

@qingtiandalaoye,
I think you are misinterpreting the specs, probably because I wasn’t clear in my writeup. A few points:

  1. The Pandas performance script is not really in Python. I believe almost everything there is done in Numpy which is C. That means DataFrame is more than 2x faster than Numpy/C.
  2. As I mentioned in “The interesting part” section, DataFrame is more than 2x faster than Pandas/Numpy in generating the same random numbers and loading them into column vectors. But DataFrame was about 10x faster in calculating means.
  3. You only load data once but calculate statistics many times. So in general DataFrame is about 10x faster than parts of Pandas that are in Numpy. Parts of Pandas that are purely in Python should be much much slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants