Performance Profiling #194

cjdsellers · 2021-01-20T21:50:50Z

So I'm just quickly documenting some of my performance profiling, following on from conversations here #190

So if the event loop is warm it takes ~35μs to generate a market order, pass through the queue in the execution engine and get it to the execution client. If epoll is active and it needs to be woken up it takes 10x that amount. Code can be found in performance_tests/test_perf_live_execution.py

That's a pretty good start. We can't compete in the ULL nanosecond space but that was never the intention for this platform.

A couple of things I can see immediately need improving are UUID and timestamp generation.

Unfortunately the UUID I refactored and c typed from the CPython source isn't much faster than the built-in (see those perf tests), although its still better to use because it can be c typed. There's a library out there fastuuid which provides Python bindings for the Rust UUID however out of the box that cant be c typed.

For the timestamp there are calls to clock.utc_now() which under the hood makes a pure Python call datetime.now(tz=pytz.utc). I've been trying to figure out how to get a POSIX directly off the system clock, there are some includes from Cython but haven't figured it out as yet.

The text was updated successfully, but these errors were encountered:

cjdsellers · 2021-01-20T21:59:54Z

This is a good read https://notes-on-cython.readthedocs.io/en/latest/classes.html

Essentially there isn't a huge difference between cpdef and cdef methods unless the method exists on a sub-classed class. I believe there's a Python-land getattr type call in that case.

There's plenty of OO going on in the codebase so there's areas where this is occurring, and either the method can be converted immediately to a cdef, or if Python access needs to be retained to maintain an API surface, or for testing - the convention I've been using is to create a cdef my_function_c which is called in turn from cpdef my_function.

Then where possible I'll call the faster my_function_c within C-land.

Of course I only justify the extra cruft with some profiling.

cjdsellers · 2021-01-22T04:35:22Z

So I figured out how to get Unix time directly off the system clock, that is now available by from nautilus_trader.core.time cimport unix_time

Next step is to figure out how to generate a datetime via C access to the PyDateTimeAPI. So far I haven't been able to figure out the correct objects the function DateTime_FromTimestamp needs. The first argument should be a type, the second the double seconds timestamp, the third possibly a time zone info type.

Where I was up to with the call is commented out inside the utc_now method of LiveClock. If anyone else wanted to have a go feel free to have a look!

jpmediadev · 2021-01-22T11:45:38Z

i found solution for wraps C code with python API:

#include <stdio.h>
#include <Python.h>
#include <datetime.h>

#define MOD_NAME "cpp"


static double doubleValue = 1314761451;

static PyObject *getDateTimePyFn(PyObject *Py_UNUSED(self), PyObject *Py_UNUSED(args)) {
    PyObject *floatObj = NULL,
        *timeTuple = NULL,
        *dateTime = NULL;
    floatObj = PyFloat_FromDouble(doubleValue);
    if (!floatObj)
    {
        return NULL;
    }
    timeTuple = Py_BuildValue("(O)", floatObj);
    Py_XDECREF(floatObj);
    if (!timeTuple)
    {
        return NULL;
    }
    dateTime = PyDateTime_FromTimestamp(timeTuple);
    Py_XDECREF(timeTuple);
    return dateTime;
}


static PyMethodDef all_methods[] = {
    { "get_datetime", getDateTimePyFn, METH_VARARGS, NULL },
    { NULL, NULL, 0, NULL }
};


static struct PyModuleDef main_module = {
    PyModuleDef_HEAD_INIT,
    MOD_NAME,
    NULL,
    -1,
    all_methods
};


PyMODINIT_FUNC PyInit_cpp(void) {
    PyDateTime_IMPORT;  // @TODO - cfati: !!! This initializes the struct containing the function pointer !!!
    return PyModule_Create(&main_module);
}

import sys
import cpp


def main():
    print("cpp.get_datetime returned: {:}".format(cpp.get_datetime()))


if __name__ == "__main__":
    main()

cjdsellers · 2021-01-22T21:32:03Z

There's some good code here. Its a possible solution however one thing I've been trying to do is keep all C code generated, so its as if Stefan Behnel himself wrote the C code, and so memory safety is realistically guaranteed. At present the platform compiles to over 2 million lines of C.

There's a cleanup script in scripts which walks all the source directories and ensures only .py, .pyx and .pxd file types exist - although this could be changed.

Thoughts on the above?

Current performance for clock.utc_now as is:
~0.0ms / ~1.4μs / 1385ns minimum of 100,000 runs @ 1 iteration each run.

Current performance for clock.unix_time is:
~0.0ms / ~0.1μs / 103ns minimum of 100,000 runs @ 1 iteration each run.

So that's more than an order different, granted one is returning a double and the other a PyObject struct. However I think it's worth pursuing this a bit more.

cjdsellers · 2021-01-22T21:46:03Z

UUIDs

Current performance for built-in uuid.uuid4
~0.0ms / ~2.1μs / 2067ns minimum of 100,000 runs @ 1 iteration each run.

Current performance for from nautilus_trader.core.uuid cimport uuid4
~0.0ms / ~1.7μs / 1731ns minimum of 100,000 runs @ 1 iteration each run.

Would be good to get this down to ~0.1μs ish, like I think we can get utc_now down to.

I've previously attempted to access the systems C API for UUIDs, but couldn't get that going at the time.

These two methods are in focus because there's so many calls to them around the codebase.

I'll do a post another day putting this in context, and listing where the time is currently spent for submitting orders.

cjdsellers · 2021-01-22T21:49:18Z

Just as a point of interest, someone told me the fastest ULL HFT systems (which we aren't competing with) apparently run at 20-40ns from the first frame of a data packet arriving on the pin, to the last frame of an order request packet being sent.

jpmediadev · 2021-01-22T21:53:45Z

trying to do is keep all C code generated, so its as if Stefan Behnel himself wrote the C code, and so memory safety is realistically guaranteed.

i agree, it's better than diving into C memory management

and what does the Cython annotation show? it should highlight the part of the code that is not optimized

the first impression is impressive, but how critical is it for the system?

for backtests that require significant calculations and performance is critical, this is important, but as I understand it, access to the system clock is not required, (time is taken from the tick record)

at the same time for a living system, if we do not want to compete with HFT, perhaps this difference is not so important, especially when compared with network latencies?

cjdsellers · 2021-01-22T21:58:17Z

You're absolutely correct. The more important goal is to have backtests run as fast as possible, and yes the system clock is not required for backtests.

Definitely significant time shouldn't be allocated to something which represents 5% of total latency when network latency will make it almost irrelevant.

Now that I look at the numbers again, its probably best to just keep the datetime.now(tz=pytz.utc) call inside utc_now?

jpmediadev · 2021-01-22T22:11:01Z

I think yes. I know that working with the system clock is one of the optimization tasks of the broader problem of accelerating random numbers generation, which is solved by something like this - https://github.com/IntelPython/mkl_random

jpmediadev · 2021-01-22T22:24:11Z

Now I am writing a solution for streaming processing based on your aggregators and indicators, I like your approach to organizing the code (I have not finished reading the DDD book yet)), I want to try applying cython`s nogil for fast parallel processing, hope this will be useful not only for me

cjdsellers · 2021-01-22T23:12:24Z

I think yes. I know that working with the system clock is one of the optimization tasks of the broader problem of accelerating random numbers generation, which is solved by something like this - https://github.com/IntelPython/mkl_random

Good find, I had a look. Seems to involve installing via conda though.

So I had a mess around generating my own stream of 16 random bytes, seeding with system clock time.

cdef extern from "stdlib.h":
    void srand48(double seedval)
    int rand()

then you can get a random byte with rand() % 128

It was coming in at ~1.3μs which when the original using the standard os.urandom(16) was only ~1.7μs just isn't worth it, as I can't guarantee the correctness and that the UUID would meet the spec.

So I'll stop optimizing there and leave as is, its fast enough.

cjdsellers · 2021-01-22T23:15:31Z

Now I am writing a solution for streaming processing based on your aggregators and indicators, I like your approach to organizing the code (I have not finished reading the DDD book yet)), I want to try applying cython`s nogil for fast parallel processing, hope this will be useful not only for me

Ok great stuff! I had a look at releasing the GIL and found it too cumbersome, but this was a while ago now. The backtests need to be deterministic too obviously unless you're using a FillModel without a random seed.

If you can get that working it could speed it up alot.

If its the Eric Evans book you're referring to, its really good - although very dense in places. I'd recommend skimming alot of it to get an overview and deep dive where necessary. There are other DDD books too which are a little more gentle.

cjdsellers · 2021-01-22T23:18:19Z

One small point I discovered through the above though is that specifying a keyword argument doesn't affect performance at all, Cython just compiles it away.

cjdsellers · 2021-01-22T23:23:55Z

If you get anywhere feel free to re-open #179

jpmediadev · 2021-01-23T09:21:02Z

One small point I discovered through the above though is that specifying a keyword argument doesn't affect performance at all, Cython just compiles it away.

great, sometimes it greatly improves the readability of the code

cjdsellers · 2021-02-20T22:21:22Z

A new issue will be opened when performance optimization comes up in a more focused way.

cjdsellers added the enhancement New feature or request label Jan 20, 2021

cjdsellers removed the enhancement New feature or request label Feb 20, 2021

cjdsellers closed this as completed Feb 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Profiling #194

Performance Profiling #194

cjdsellers commented Jan 20, 2021 •

edited

cjdsellers commented Jan 20, 2021 •

edited

cjdsellers commented Jan 22, 2021 •

edited

jpmediadev commented Jan 22, 2021

cjdsellers commented Jan 22, 2021 •

edited

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

jpmediadev commented Jan 22, 2021

cjdsellers commented Jan 22, 2021 •

edited

jpmediadev commented Jan 22, 2021

jpmediadev commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

jpmediadev commented Jan 23, 2021

cjdsellers commented Feb 20, 2021

Performance Profiling #194

Performance Profiling #194

Comments

cjdsellers commented Jan 20, 2021 • edited

cjdsellers commented Jan 20, 2021 • edited

cjdsellers commented Jan 22, 2021 • edited

jpmediadev commented Jan 22, 2021

cjdsellers commented Jan 22, 2021 • edited

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

jpmediadev commented Jan 22, 2021

cjdsellers commented Jan 22, 2021 • edited

jpmediadev commented Jan 22, 2021

jpmediadev commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

cjdsellers commented Jan 22, 2021

jpmediadev commented Jan 23, 2021

cjdsellers commented Feb 20, 2021

cjdsellers commented Jan 20, 2021 •

edited

cjdsellers commented Jan 20, 2021 •

edited

cjdsellers commented Jan 22, 2021 •

edited

cjdsellers commented Jan 22, 2021 •

edited

cjdsellers commented Jan 22, 2021 •

edited