## UCX-Py Example

This example demonstrates how to use the high-level UCX-Py API, providing asynchronous I/O capability. This is the API used by high-level Python libraries and frameworks, such as Dask.

The high-level API is responsible for managing UCP context and worker automatically, currently multithreading is not supported. Creating UCP listeners and endpoints are the user's responsibility via `ucp.create_listener` and `ucp.create_endpoint`, respectively.

Progressing the UCP worker occurs automatically and is transparent to the user. Upon initialization, UCX-Py will create an asynchronous progress task that will be continuously executed on each iterations of the Python's asyncio event loop. Two progress modes are supported:

1. Blocking (default): the UCP worker delivers a file descriptor that can be awaited, each time it is awaken by the worker, a new async task is registered in the event loop to progress the worker and rearm the worker when completed;
1. Non-blocking (enabled by setting the `UCXPY_NON_BLOCKING_MODE=1` environment variable): an async task is registered in the event loop to progress the worker and re-register itself, upon eah event loop iteration the worker will be progressed whether there is work to progress or not.

In this example the UCX Tag API is used, `ep.send()` is an interface for `ucp_tag_send_nb`, and `ep.recv` is an interface for `ucp_tag_recv_nb`. Tags are assigned automatically between the client and listener during endpoint establishment, it is possible to override the pre-established Tag with `tag` and `force_tag` arguments from `ep.send()`/`ep.recv()`.

## Import Dependencies

1. `asyncio`: Python's asynchronous I/O library
1. `ucp`: Asynchronous UCX-Py interface for UCX
1. `numpy`: NumPy library providing fundamental scientific computing for Python
    1. `cupy`: CuPy is a CUDA-enabled implementation of the NumPy API, replacing `numpy` by `cupy` will use CUDA-based arrays that will be transferred via optimal hardware interconnetcs when available, such as NVLink (CUDA IPC) or InfiniBand (GPUDirectRDMA)

In [1]:
import asyncio
import ucp
import numpy as np

## Listener (aka, server)

### Define listener port

The port `ListenerPort` where the listener will bind to, subsequently used by the client to connect.

In [2]:
ListenerPort = 12345

### Define a listener callback

In this example the callback `listener_callback` will execute the following operations:

1. Allocate single-element NumPy 64-bit unsigned integer array
1. Receive message `msg_size` containing the message size `N` (in bytes)
1. Allocate NumPy 8-bit unsigned integer array of `N` elements (i.e., bytes)
1. Receive message `msg` containing the data
1. Print message received
1. Increment all elements of the receive message storing as NumPy array `reply`
1. Send `reply` back to client

To finalize, the content of the received NumPy array will be printed.

In [3]:
async def listener_callback(ep):
    msg_size = np.empty(1, dtype=np.uint64)
    await ep.recv(msg_size)
    
    msg = np.empty(int(msg_size[0]), dtype="u1")
    await ep.recv(msg)
    
    print(f"Listener received {msg_size[0]} bytes: {msg}")
    
    reply = msg + 1
    await ep.send(reply)

### Create Listener

Create a UCX-Py listener, which will bind to port `ListenerPort` and call the `listener_callback` callback for each new incoming client connection

In [4]:
listener = ucp.create_listener(listener_callback, ListenerPort)

## Client

### Create Client Endpoint

In [5]:
host = ucp.get_address(ifname='enp1s0f0')  # ethernet device name
ep = await ucp.create_endpoint(host, ListenerPort)

### Exchange messages

1. Allocate NumPy 8-bit unsigned integer array `msg` of `N` elements (i.e., bytes) populated with 0s
1. Allocate unpopulated NumPy 8-bit unsigned integer array `reply` of `N` elements (i.e., bytes)
1. Allocate single-element NumPy 64-bit unsigned integer array populated with the message to send/receive size `N` (in bytes)
1. Send message `msg_size` containing the message size `N` (in bytes)
1. Send message `msg` containing the data
1. Receive `reply` back from listener
1. Print message received
1. Assert `reply` result is correct: `reply == (msg + 1)`

In [6]:
n_bytes = 10**9

msg = np.zeros(n_bytes, dtype='u1')
reply = np.empty(n_bytes, dtype='u1')
msg_size = np.array([msg.nbytes], dtype=np.uint64)

await ep.send(msg_size)
await ep.send(msg)

await ep.recv(reply)

print(f"Client received {msg_size[0]} bytes: {reply}")

np.testing.assert_array_equal(reply, msg + 1)

Listener received 1000000000 bytes: [0 0 0 ... 0 0 0]
Client received 1000000000 bytes: [1 1 1 ... 1 1 1]


## Close Listener

Explicitly close the listener, otherwise closed when it goes out-of-scope. The endpoint has already been closed, since the `listener_callback` terminates the connection when the callback returns.

In [7]:
listener.close()