Skip to content

Technical Implementation Details

Dave Morrissey edited this page Aug 30, 2020 · 9 revisions

This module provides RPC via custom shared memory blocks, synchronised by a hybrid spinlock/named semaphore. This potentially allows sub-millisecond latencies, and high throughput, at a cost of some wasted CPU cycles (up to around 1 millisecond per call).

This module is useful when moving functions/in-memory data to dedicated process(es) rather than in each webserver worker process, which can use less RAM. This can also be useful when the Global Interpreter Lock (GIL) is a limiting factor, as it can scale up or down worker processes depending on CPU usage over time.

Unlike the python mmap module, speedysvc does not page written data to file on disk (is not copy-on-write), often resulting in performance not much less than if functions were called in-process.

It was also intended to be a way of allowing for a separation of concerns, effectively allowing larger complex programs to be moved into smaller "blocks" or microservices. Each shared memory client to server "connection" allocates a shared memory block, which starts at >=2048 bytes, and expands when requests/responses are larger than can be written. It does this in increments of powers of 2 of the operating system's page size.

Each client connection needs a single shared memory block and thread on each worker server. The latter also has some overhead, but in my case I thought this would be low enough for most situations I would be likely to use this. Currently only a single connection can be made to a service for each individual process, as shared memory is referenced by the client process' PID.

Remote Network Calls

It also allows RPC to be performed via ordinary TCP sockets. It uses a specific protocol which sends the length of data prior to sending the data itself so as to improve buffering performance. This can be many times slower than shared memory, but could allow connections to remote hosts.

A unique port number and service name must be provided by servers. Although the port can be either an integer or bytes for the shared memory server, it's normally best to keep this as a number, to allow compatibility with network sockets.

Web Management Interface

A management interface (by default on http://127.0.0.1:5155) can allow viewing each service's status as defined in the .ini file, and view memory, io and cpu usage over time, as well as stdout/stderr logs.

Implementation Considerations for IPC

It's a common situation in the c implementation of python where one is limited by the GIL, and you can't use more than a single CPU core at once for a single process. I wanted to separate certain aspects of my software into different processes, and call them as if they were local, with as little difference in performance (latency and throughput) as possible.

Different Solutions for Using More CPU Cores

  • Have a single process, and just live with only using a single core. (Or write modules in c/cython which bypass the GIL).
  • Have multiple processes. Load modules with relevant in-memory data in every process. This can make good use of CPU, but use huge amounts of memory if you have more than a few worker processes (in my case many gigabytes). This can get quite expensive on cloud servers where RAM is at a premium, and limit options.
  • Use the multiprocessing module. However, this is mainly useful for communication between the parent process and child processes managed by the multiprocessing module. It also uses pipe2 for communication, and so it can be slower than shared memory, as described below.
  • Still have multiple processes, but move modules into external processes or "microservices", and use inter-process communication, or IPC to reduce wastage of RAM and other resources. This is the approach I decided on.

Different kinds of IPC on Linux/Unix

  • Using methods which use kernel-level synchronisation, such as sockets (Unix(tm) domain sockets, or TCP sockets), message queues, or pipe/pipe2. This can have a high latency, and was limited to 10-20,000 requests a second in my benchmarks.
  • Using shared memory, which requires process-level synchronisation to be performed manually by processes. Synchronisation can be performed by spinlocks, named semaphores or mutexes. This is the approach used by this module.

A spinlock as the title suggests "spins", or keeps looping asking "are you done yet?" until the task is complete. In a single-processor system, this will slow things down, but in a multi-processor system that uses pre-emptive multitasking this can be faster if the task can be completed in less than the process time slice, which often is between 0.75ms and 6ms on Linux.

By contrast, using mutexes or using binary named semaphores can prevent wasting CPU cycles, but this can run the risk of blocking a process while waiting for a task that takes a fraction of a millisecond. This can increase latency by orders of magnitude for non-cpu/io-bound calls.

Currently, this module is hardcoded to spin for up to 1ms, and thereafter leaves it up to named semaphores to block.

Clone this wiki locally