Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Profiling #194

Closed
cjdsellers opened this issue Jan 20, 2021 · 16 comments
Closed

Performance Profiling #194

cjdsellers opened this issue Jan 20, 2021 · 16 comments

Comments

@cjdsellers
Copy link
Member

cjdsellers commented Jan 20, 2021

So I'm just quickly documenting some of my performance profiling, following on from conversations here #190

So if the event loop is warm it takes ~35μs to generate a market order, pass through the queue in the execution engine and get it to the execution client. If epoll is active and it needs to be woken up it takes 10x that amount. Code can be found in performance_tests/test_perf_live_execution.py

That's a pretty good start. We can't compete in the ULL nanosecond space but that was never the intention for this platform.

A couple of things I can see immediately need improving are UUID and timestamp generation.

Unfortunately the UUID I refactored and c typed from the CPython source isn't much faster than the built-in (see those perf tests), although its still better to use because it can be c typed. There's a library out there fastuuid which provides Python bindings for the Rust UUID however out of the box that cant be c typed.

For the timestamp there are calls to clock.utc_now() which under the hood makes a pure Python call datetime.now(tz=pytz.utc). I've been trying to figure out how to get a POSIX directly off the system clock, there are some includes from Cython but haven't figured it out as yet.

@cjdsellers cjdsellers added the enhancement New feature or request label Jan 20, 2021
@cjdsellers
Copy link
Member Author

cjdsellers commented Jan 20, 2021

This is a good read https://notes-on-cython.readthedocs.io/en/latest/classes.html

Essentially there isn't a huge difference between cpdef and cdef methods unless the method exists on a sub-classed class. I believe there's a Python-land getattr type call in that case.

There's plenty of OO going on in the codebase so there's areas where this is occurring, and either the method can be converted immediately to a cdef, or if Python access needs to be retained to maintain an API surface, or for testing - the convention I've been using is to create a cdef my_function_c which is called in turn from cpdef my_function.

Then where possible I'll call the faster my_function_c within C-land.

Of course I only justify the extra cruft with some profiling.

@cjdsellers
Copy link
Member Author

cjdsellers commented Jan 22, 2021

So I figured out how to get Unix time directly off the system clock, that is now available by from nautilus_trader.core.time cimport unix_time

Next step is to figure out how to generate a datetime via C access to the PyDateTimeAPI. So far I haven't been able to figure out the correct objects the function DateTime_FromTimestamp needs. The first argument should be a type, the second the double seconds timestamp, the third possibly a time zone info type.

Where I was up to with the call is commented out inside the utc_now method of LiveClock. If anyone else wanted to have a go feel free to have a look!

@jpmediadev
Copy link
Contributor

i found solution for wraps C code with python API:

#include <stdio.h>
#include <Python.h>
#include <datetime.h>

#define MOD_NAME "cpp"


static double doubleValue = 1314761451;

static PyObject *getDateTimePyFn(PyObject *Py_UNUSED(self), PyObject *Py_UNUSED(args)) {
    PyObject *floatObj = NULL,
        *timeTuple = NULL,
        *dateTime = NULL;
    floatObj = PyFloat_FromDouble(doubleValue);
    if (!floatObj)
    {
        return NULL;
    }
    timeTuple = Py_BuildValue("(O)", floatObj);
    Py_XDECREF(floatObj);
    if (!timeTuple)
    {
        return NULL;
    }
    dateTime = PyDateTime_FromTimestamp(timeTuple);
    Py_XDECREF(timeTuple);
    return dateTime;
}


static PyMethodDef all_methods[] = {
    { "get_datetime", getDateTimePyFn, METH_VARARGS, NULL },
    { NULL, NULL, 0, NULL }
};


static struct PyModuleDef main_module = {
    PyModuleDef_HEAD_INIT,
    MOD_NAME,
    NULL,
    -1,
    all_methods
};


PyMODINIT_FUNC PyInit_cpp(void) {
    PyDateTime_IMPORT;  // @TODO - cfati: !!! This initializes the struct containing the function pointer !!!
    return PyModule_Create(&main_module);
}
import sys
import cpp


def main():
    print("cpp.get_datetime returned: {:}".format(cpp.get_datetime()))


if __name__ == "__main__":
    main()

@cjdsellers
Copy link
Member Author

cjdsellers commented Jan 22, 2021

There's some good code here. Its a possible solution however one thing I've been trying to do is keep all C code generated, so its as if Stefan Behnel himself wrote the C code, and so memory safety is realistically guaranteed. At present the platform compiles to over 2 million lines of C.

There's a cleanup script in scripts which walks all the source directories and ensures only .py, .pyx and .pxd file types exist - although this could be changed.

Thoughts on the above?

Current performance for clock.utc_now as is:
~0.0ms / ~1.4μs / 1385ns minimum of 100,000 runs @ 1 iteration each run.

Current performance for clock.unix_time is:
~0.0ms / ~0.1μs / 103ns minimum of 100,000 runs @ 1 iteration each run.

So that's more than an order different, granted one is returning a double and the other a PyObject struct. However I think it's worth pursuing this a bit more.

@cjdsellers
Copy link
Member Author

UUIDs

Current performance for built-in uuid.uuid4
~0.0ms / ~2.1μs / 2067ns minimum of 100,000 runs @ 1 iteration each run.

Current performance for from nautilus_trader.core.uuid cimport uuid4
~0.0ms / ~1.7μs / 1731ns minimum of 100,000 runs @ 1 iteration each run.

Would be good to get this down to ~0.1μs ish, like I think we can get utc_now down to.

I've previously attempted to access the systems C API for UUIDs, but couldn't get that going at the time.

These two methods are in focus because there's so many calls to them around the codebase.

I'll do a post another day putting this in context, and listing where the time is currently spent for submitting orders.

@cjdsellers
Copy link
Member Author

Just as a point of interest, someone told me the fastest ULL HFT systems (which we aren't competing with) apparently run at 20-40ns from the first frame of a data packet arriving on the pin, to the last frame of an order request packet being sent.

@jpmediadev
Copy link
Contributor

trying to do is keep all C code generated, so its as if Stefan Behnel himself wrote the C code, and so memory safety is realistically guaranteed.

i agree, it's better than diving into C memory management

and what does the Cython annotation show? it should highlight the part of the code that is not optimized

the first impression is impressive, but how critical is it for the system?

for backtests that require significant calculations and performance is critical, this is important, but as I understand it, access to the system clock is not required, (time is taken from the tick record)

at the same time for a living system, if we do not want to compete with HFT, perhaps this difference is not so important, especially when compared with network latencies?

@cjdsellers
Copy link
Member Author

cjdsellers commented Jan 22, 2021

You're absolutely correct. The more important goal is to have backtests run as fast as possible, and yes the system clock is not required for backtests.

Definitely significant time shouldn't be allocated to something which represents 5% of total latency when network latency will make it almost irrelevant.

Now that I look at the numbers again, its probably best to just keep the datetime.now(tz=pytz.utc) call inside utc_now?

@jpmediadev
Copy link
Contributor

I think yes. I know that working with the system clock is one of the optimization tasks of the broader problem of accelerating random numbers generation, which is solved by something like this - https://github.com/IntelPython/mkl_random

@jpmediadev
Copy link
Contributor

Now I am writing a solution for streaming processing based on your aggregators and indicators, I like your approach to organizing the code (I have not finished reading the DDD book yet)), I want to try applying cython`s nogil for fast parallel processing, hope this will be useful not only for me

@cjdsellers
Copy link
Member Author

I think yes. I know that working with the system clock is one of the optimization tasks of the broader problem of accelerating random numbers generation, which is solved by something like this - https://github.com/IntelPython/mkl_random

Good find, I had a look. Seems to involve installing via conda though.

So I had a mess around generating my own stream of 16 random bytes, seeding with system clock time.

cdef extern from "stdlib.h":
    void srand48(double seedval)
    int rand()

then you can get a random byte with rand() % 128

It was coming in at ~1.3μs which when the original using the standard os.urandom(16) was only ~1.7μs just isn't worth it, as I can't guarantee the correctness and that the UUID would meet the spec.

So I'll stop optimizing there and leave as is, its fast enough.

@cjdsellers
Copy link
Member Author

Now I am writing a solution for streaming processing based on your aggregators and indicators, I like your approach to organizing the code (I have not finished reading the DDD book yet)), I want to try applying cython`s nogil for fast parallel processing, hope this will be useful not only for me

Ok great stuff! I had a look at releasing the GIL and found it too cumbersome, but this was a while ago now. The backtests need to be deterministic too obviously unless you're using a FillModel without a random seed.

If you can get that working it could speed it up alot.

If its the Eric Evans book you're referring to, its really good - although very dense in places. I'd recommend skimming alot of it to get an overview and deep dive where necessary. There are other DDD books too which are a little more gentle.

@cjdsellers
Copy link
Member Author

One small point I discovered through the above though is that specifying a keyword argument doesn't affect performance at all, Cython just compiles it away.

@cjdsellers
Copy link
Member Author

If you get anywhere feel free to re-open #179

@jpmediadev
Copy link
Contributor

One small point I discovered through the above though is that specifying a keyword argument doesn't affect performance at all, Cython just compiles it away.

great, sometimes it greatly improves the readability of the code

@cjdsellers cjdsellers removed the enhancement New feature or request label Feb 20, 2021
@cjdsellers
Copy link
Member Author

A new issue will be opened when performance optimization comes up in a more focused way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants