-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Profiling #194
Comments
This is a good read https://notes-on-cython.readthedocs.io/en/latest/classes.html Essentially there isn't a huge difference between There's plenty of OO going on in the codebase so there's areas where this is occurring, and either the method can be converted immediately to a Then where possible I'll call the faster Of course I only justify the extra cruft with some profiling. |
So I figured out how to get Unix time directly off the system clock, that is now available by Next step is to figure out how to generate a Where I was up to with the call is commented out inside the |
i found solution for wraps C code with python API:
|
There's some good code here. Its a possible solution however one thing I've been trying to do is keep all C code generated, so its as if Stefan Behnel himself wrote the C code, and so memory safety is realistically guaranteed. At present the platform compiles to over 2 million lines of C. There's a cleanup script in Thoughts on the above? Current performance for Current performance for So that's more than an order different, granted one is returning a |
UUIDs Current performance for built-in Current performance for Would be good to get this down to ~0.1μs ish, like I think we can get I've previously attempted to access the systems C API for UUIDs, but couldn't get that going at the time. These two methods are in focus because there's so many calls to them around the codebase. I'll do a post another day putting this in context, and listing where the time is currently spent for submitting orders. |
Just as a point of interest, someone told me the fastest ULL HFT systems (which we aren't competing with) apparently run at 20-40ns from the first frame of a data packet arriving on the pin, to the last frame of an order request packet being sent. |
i agree, it's better than diving into C memory management and what does the Cython annotation show? it should highlight the part of the code that is not optimized the first impression is impressive, but how critical is it for the system? for backtests that require significant calculations and performance is critical, this is important, but as I understand it, access to the system clock is not required, (time is taken from the tick record) at the same time for a living system, if we do not want to compete with HFT, perhaps this difference is not so important, especially when compared with network latencies? |
You're absolutely correct. The more important goal is to have backtests run as fast as possible, and yes the system clock is not required for backtests. Definitely significant time shouldn't be allocated to something which represents 5% of total latency when network latency will make it almost irrelevant. Now that I look at the numbers again, its probably best to just keep the |
I think yes. I know that working with the system clock is one of the optimization tasks of the broader problem of accelerating random numbers generation, which is solved by something like this - https://github.com/IntelPython/mkl_random |
Now I am writing a solution for streaming processing based on your aggregators and indicators, I like your approach to organizing the code (I have not finished reading the DDD book yet)), I want to try applying cython`s nogil for fast parallel processing, hope this will be useful not only for me |
Good find, I had a look. Seems to involve installing via conda though. So I had a mess around generating my own stream of 16 random bytes, seeding with system clock time.
then you can get a random byte with It was coming in at ~1.3μs which when the original using the standard So I'll stop optimizing there and leave as is, its fast enough. |
Ok great stuff! I had a look at releasing the GIL and found it too cumbersome, but this was a while ago now. The backtests need to be deterministic too obviously unless you're using a If you can get that working it could speed it up alot. If its the Eric Evans book you're referring to, its really good - although very dense in places. I'd recommend skimming alot of it to get an overview and deep dive where necessary. There are other DDD books too which are a little more gentle. |
One small point I discovered through the above though is that specifying a keyword argument doesn't affect performance at all, Cython just compiles it away. |
If you get anywhere feel free to re-open #179 |
great, sometimes it greatly improves the readability of the code |
A new issue will be opened when performance optimization comes up in a more focused way. |
So I'm just quickly documenting some of my performance profiling, following on from conversations here #190
So if the event loop is warm it takes ~35μs to generate a market order, pass through the queue in the execution engine and get it to the execution client. If epoll is active and it needs to be woken up it takes 10x that amount. Code can be found in
performance_tests/test_perf_live_execution.py
That's a pretty good start. We can't compete in the ULL nanosecond space but that was never the intention for this platform.
A couple of things I can see immediately need improving are UUID and timestamp generation.
Unfortunately the UUID I refactored and c typed from the CPython source isn't much faster than the built-in (see those perf tests), although its still better to use because it can be c typed. There's a library out there
fastuuid
which provides Python bindings for the Rust UUID however out of the box that cant be c typed.For the timestamp there are calls to
clock.utc_now()
which under the hood makes a pure Python calldatetime.now(tz=pytz.utc)
. I've been trying to figure out how to get a POSIX directly off the system clock, there are some includes from Cython but haven't figured it out as yet.The text was updated successfully, but these errors were encountered: