# Differentiable bitonic sort

[Bitonic sorts](https://en.wikipedia.org/wiki/Bitonic_sorter) allow creation of sorting networks with a sequence of fixed conditional swapping operations executed in parallel. A sorting network implements  a map from $\mathbb{R}^n \rightarrow \mathbb{R}^n$, where $n=2^k$ (sorting networks for non-power-of-2 sizes are possible but not trickier).

<img src="BitonicSort1.svg.png">

*[Image: from Wikipedia, by user Bitonic, CC0](https://en.wikipedia.org/wiki/Bitonic_sorter#/media/File:BitonicSort1.svg)*

The sorting network for $n=2^k$ elements has $\frac{k(k-1)}{2}$ "layers" where parallel compare-and-swap operations are used to rearrange a $k$ element vector into sorted order.

### Differentiable compare-and-swap

If we define the `softmax(a,b)` function (not the traditional "softmax" used for classification!) as the continuous approximation to the `max(a,b)` function:

$$\text{softmax}(a,b) = \log(e^a + e^b) \approx \max(a,b).$$

We can then fairly obviously write `softmin(a,b)` as:

$$\text{softmin}(a,b) = -\log(e^{-a} + e^{-b}) \approx \min(a,b).$$

These functions obviously aren't equal to max and min, but are relatively close, and differentiable. Note that we now have a differentiable compare-and-swap operation:

$$\text{high} = \text{softmax}(a,b), \text{low} = \text{softmin}(a,b), \text{where } \text{low}\leq \text{high}$$

## Differentiable sorting

For each layer in the sorting network, we can split all of the pairwise comparison-and-swaps into left-hand and right-hand sides which can be done simultaneously. We can any write function that selects the relevant elements of the vector as a multiply with a binary matrix.

For each layer, we can derive two binary matrices $L \in \mathbb{R}^{n \times \frac{n}{2}}$ and $R \in \mathbb{R}^{n \times \frac{n}{2}}$ which select the elements to be compared for the left and right hands respectively. This will result in the comparison between two $\frac{k}{2}$ length vectors. We can also derive two matrices $L' \in \mathbb{R}^{\frac{n}{2} \times n}$ and $R' \in \mathbb{R}^{\frac{n}{2} \times n}$ which put the results of the compare-and-swap operation back into the right positions.

Then, each layer $i$ of the sorting process is just:
$${\bf x}_{i+1} = L'_i[\text{softmin}(L_i{\bf x_i}, R_i{\bf x_i})] + R'_i[\text{softmax}(L_i{\bf x_i}, R_i{\bf x_i})]$$
$$ = L'_i\left(-\log\left(e^{-L_i{\bf x}_i} + e^{-R_i{\bf x}_i}\right)\right) +  R'_i\left(\log\left(e^{L_i{\bf x}_i} + e^{R_i{\bf x}_i}\right)\right)$$
which is clearly differentiable (though not very numerically stable -- the usable range of elements $x$ is quite limited in single float precision).

All that remains is to compute the matrices $L_i, R_i, L'_i, R'_i$ for each of the layers of the network. 

This process is excessively computation heavy, but easy to compute. We could also simplify this into two matrix multiplies, at the cost of a vector split and join in the middle (see the `woven` form later in this text). 

## Example

To sort four elements, we have a network like:

    0  1  2  3  
    ┕>>┙  │  │  
    │  │  ┕<<┙  
    ┕>>>>>┙  │  
    │  │  │  │  
    ┕>>┙  │  │  
    │  │  ┕>>┙  
    
This is equivalent to: 

    x[0], x[1] = cswap(x[0], x[1])
    x[3], x[2] = cswap(x[2], x[3])
    x[0], x[2] = cswap(x[0], x[2])
    x[0], x[1] = cswap(x[0], x[1])
    x[2], x[3] = cswap(x[2], x[3])
    
where `cswap(a,b) = (min(a,b), max(a,b))`

Replacing the indexing with matrix multiplies and `cswap` with a `softcswap = (softmin(a,b), softmax(a,b))` we then have the differentiable form.



# Test functions

In [3]:
from bitonic_tests import bitonic_network, pretty_bitonic_network

# this should match the diagram at the top of the notebook
bitonic_network(16)

 0>1	 2<3	 4>5	 6<7	 8>9	10<11	12>13	14<15	
----------------------------------------------------------------
 0>2	 1>3	 4<6	 5<7	 8>10	 9>11	12<14	13<15	
 0>1	 2>3	 4<5	 6<7	 8>9	10>11	12<13	14<15	
----------------------------------------------------------------
 0>4	 1>5	 2>6	 3>7	 8<12	 9<13	10<14	11<15	
 0>2	 1>3	 4>6	 5>7	 8<10	 9<11	12<14	13<15	
 0>1	 2>3	 4>5	 6>7	 8<9	10<11	12<13	14<15	
----------------------------------------------------------------
 0>8	 1>9	 2>10	 3>11	 4>12	 5>13	 6>14	 7>15	
 0>4	 1>5	 2>6	 3>7	 8>12	 9>13	10>14	11>15	
 0>2	 1>3	 4>6	 5>7	 8>10	 9>11	12>14	13>15	
 0>1	 2>3	 4>5	 6>7	 8>9	10>11	12>13	14>15	
----------------------------------------------------------------


In [4]:
pretty_bitonic_network(16)

0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 
+>>+  │  │  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  +<<+  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  +>>+  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  +<<+  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  │  │  +>>+  │  │  │  │  │  │  
│  │  │  │  │  │  │  │  │  │  +<<+  │  │  │  │  
│  │  │  │  │  │  │  │  │  │  │  │  +>>+  │  │  
│  │  │  │  │  │  │  │  │  │  │  │  │  │  +<<+  
+>>>>>+  │  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  +<<<<<+  │  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  │  │  +>>>>>+  │  │  │  │  │  
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  │  │  │  │  │  │  +<<<<<+  │  
│  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  
+>>+  │  │  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  +>>+  │  │  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  +<<+  │  │  │  │  │  │  │  │  │  │  
│  │  │  │  │  │  +<

# Vectorised functions

## Testing

In [5]:
# Test sorting
import autograd.numpy as np # we can use plain numpy as well


from differentiable_sorting import bitonic_matrices, diff_bisort, diff_rank, softmax, softmin, softcswap, bisort
matrices = bitonic_matrices(8)

for i in range(10):
    # these should all be in sorted order
    test = np.random.randint(0, 200, 8)
    print(bisort(matrices, test))    

[ 54.  56.  69.  70.  90. 134. 177. 186.]
[  5.  23.  80.  87. 117. 122. 149. 199.]
[  9.  33. 101. 114. 142. 154. 167. 171.]
[ 16.  16.  69.  77. 124. 160. 163. 198.]
[  0.  28.  29.  33.  49. 119. 132. 144.]
[ 23.  58.  59.  65.  69. 104. 119. 193.]
[ 21.  45.  48.  51.  77. 153. 159. 167.]
[ 18.  18.  37.  74. 112. 154. 163. 183.]
[ 44.  66.  76.  87. 132. 139. 149. 160.]
[  2.  30.  31.  61.  74. 106. 112. 135.]


In [6]:
for i in range(1, 11):
    k = 2**i
    matrices = bitonic_matrices(k)
    print(f"Testing sorting for {k} elements")
    for j in range(100):
        test = np.random.randint(0, 200, k)
        assert (np.allclose(bisort(matrices, test), np.sort(test)))

Testing sorting for 2 elements
Testing sorting for 4 elements
Testing sorting for 8 elements
Testing sorting for 16 elements
Testing sorting for 32 elements
Testing sorting for 64 elements
Testing sorting for 128 elements
Testing sorting for 256 elements
Testing sorting for 512 elements
Testing sorting for 1024 elements


## Differentiable sorting test

In [7]:
# Differentiable sorting 
np.set_printoptions(precision=2)
matrices = bitonic_matrices(8) 
def neat_vec(n):
    return "\t".join([f"{x:.2f}" for x in n])

for i in range(10):
    test = np.random.randint(-200,200,8)
    print("Differentiable", neat_vec(diff_bisort(matrices, test)))
    print("Exact sorting ", neat_vec(bisort(matrices, test)))
    print()

Differentiable -149.00	-43.00	-7.31	-5.69	36.00	88.00	137.00	186.00
Exact sorting  -149.00	-43.00	-7.00	-6.00	36.00	88.00	137.00	186.00

Differentiable -164.00	-155.00	-9.00	57.00	81.91	85.09	122.00	159.00
Exact sorting  -164.00	-155.00	-9.00	57.00	82.00	85.00	122.00	159.00

Differentiable -102.00	-55.00	2.00	42.00	48.00	119.00	164.00	187.00
Exact sorting  -102.00	-55.00	2.00	42.00	48.00	119.00	164.00	187.00

Differentiable -196.09	-192.91	-114.00	-108.00	-4.00	37.00	63.00	82.00
Exact sorting  -196.00	-193.00	-114.00	-108.00	-4.00	37.00	63.00	82.00

Differentiable -28.00	-19.00	55.00	71.00	111.00	119.00	149.00	160.00
Exact sorting  -28.00	-19.00	55.00	71.00	111.00	119.00	149.00	160.00

Differentiable -200.00	-175.00	-98.00	-89.01	-83.99	11.00	110.77	113.23
Exact sorting  -200.00	-175.00	-98.00	-89.00	-84.00	11.00	111.00	113.00

Differentiable -176.00	-151.00	-73.00	-64.00	8.00	16.00	51.00	143.00
Exact sorting  -176.00	-151.00	-73.00	-64.00	8.00	16.00	51.00	143.00

Differentiable -197.0

# Relaxed sorting
We can define a slighly modified function which interpolates between `softmax(a,b)` and `mean(a,b)`. The result is a sorting function that can be relaxed from sorting to averaging.

In [8]:
from differentiable_sorting import diff_bisort_smooth
# Differentiable smoothed sorting 
test = np.random.randint(-200,200,8)
print(f"Mean {np.mean(test):.2f}")
print()
print("Exact sorting       ", neat_vec(bisort(matrices, test)))
for smooth in np.linspace(0, 1, 8):    
    print(f"Diff. smooth[{smooth:.2f}]  ", neat_vec(diff_bisort_smooth(matrices, test, smooth)))
        

Mean -4.50

Exact sorting        -153.00	-36.00	-28.00	-16.00	-15.00	20.00	65.00	127.00
Diff. smooth[0.00]   -153.00	-36.00	-28.00	-16.31	-14.69	20.00	65.00	127.00
Diff. smooth[0.14]   -91.78	-34.13	-19.37	-14.38	2.65	10.69	36.70	73.61
Diff. smooth[0.29]   -52.21	-27.04	-14.72	-11.28	5.15	7.74	17.98	38.39
Diff. smooth[0.43]   -28.66	-19.06	-13.83	-10.19	2.74	3.47	11.04	18.50
Diff. smooth[0.57]   -15.86	-13.27	-12.81	-11.13	0.85	1.69	6.55	7.98
Diff. smooth[0.71]   -10.48	-10.09	-10.13	-9.73	0.03	0.34	1.86	2.18
Diff. smooth[0.86]   -7.11	-6.96	-6.95	-6.80	-2.23	-2.08	-2.01	-1.86
Diff. smooth[1.00]   -4.50	-4.50	-4.50	-4.50	-4.50	-4.50	-4.50	-4.50


In [17]:
from autograd import jacobian
# show that we can take the derivative
jac_sort = jacobian(diff_bisort_smooth, argnum=1)
jac_sort(matrices, test, 0.05) # slight relaxation

array([[0.02, 0.  , 0.  , 0.  , 0.02, 0.03, 0.86, 0.07],
       [0.  , 0.  , 0.02, 0.12, 0.01, 0.09, 0.06, 0.69],
       [0.02, 0.02, 0.02, 0.71, 0.02, 0.09, 0.01, 0.11],
       [0.  , 0.02, 0.  , 0.07, 0.04, 0.74, 0.02, 0.1 ],
       [0.05, 0.86, 0.05, 0.02, 0.  , 0.02, 0.  , 0.  ],
       [0.02, 0.04, 0.86, 0.02, 0.02, 0.  , 0.  , 0.02],
       [0.32, 0.02, 0.02, 0.02, 0.57, 0.03, 0.02, 0.  ],
       [0.57, 0.03, 0.02, 0.02, 0.32, 0.02, 0.02, 0.  ]])

## Woven form
We can "weave" the four matrices into two matrices for fewer multiplies at the cost of having to split and join the matrices at each layer.

In [10]:
from differentiable_sorting import bisort_woven_matrices

def diff_bisort_weave(matrices, x):
    """
    Given a set of bitonic sort matrices generated by bitonic_woven_matrices(n), sort 
    a sequence x of length n.
    """
    split = len(x) // 2
    for weave, unweave in matrices:
        woven = weave @ x
        x = unweave @ np.concatenate(softcswap(woven[:split], woven[split:]))
    return x


woven_matrices = bisort_woven_matrices(8)

print("Exact sorting       ", neat_vec(bisort(matrices, test)))
print(f"Diff. (std.)       ", neat_vec(diff_bisort(matrices, test)))
print(f"Diff. (woven)      ", neat_vec(diff_bisort_weave(woven_matrices, test)))
        

Exact sorting        -153.00	-36.00	-28.00	-16.00	-15.00	20.00	65.00	127.00
Diff. (std.)        -153.00	-36.00	-28.00	-16.31	-14.69	20.00	65.00	127.00
Diff. (woven)       -153.00	-36.00	-28.00	-16.31	-14.69	20.00	65.00	127.00


## Differentiable ranking
We can use a differentiable similarity measure between the input and output of the vector, e.g. an RBF kernel. We can use this to generate a normalised similarity matrix and apply this to a vector `[1, 2, 3, ..., n]`. This gives a differentiable ranking function.

As `sigma` gets larger, the result converges to giving all values the mean rank; as it goes to zero the result converges to the true rank.

In [11]:
from differentiable_sorting import order_matrix, diff_rank

In [12]:
matrices = bitonic_matrices(8)
test = np.random.randint(0, 200, 8)
sortd = diff_bisort(matrices, test)

In [14]:
print("Smoothed ranks")
for sigma in [0.1, 1, 10, 100, 1000]:
    similarity = order_matrix(test, sortd, sigma=sigma)    
    ranks = np.arange(len(test))    
    print(f"sigma={sigma:7.1f}  |", neat_vec(similarity @ ranks))

Smoothed ranks
sigma=    0.1  | 6.00	3.00	5.00	7.00	1.00	2.00	0.00	4.00
sigma=    1.0  | 6.00	3.00	5.16	6.84	1.00	2.00	0.38	3.62
sigma=   10.0  | 5.03	5.10	5.24	5.32	1.00	2.00	1.98	2.02
sigma=  100.0  | 4.09	4.06	4.01	4.00	3.55	3.13	2.65	2.64
sigma= 1000.0  | 3.51	3.51	3.51	3.51	3.50	3.50	3.49	3.49


In [15]:
def diff_rank(matrices, x, sigma=0.1):
    """Return the smoothed, differentiable ranking of each element of x. Sigma
    specifies the smoothing of the ranking. """
    sortd = diff_bisort(matrices, x)
    return order_matrix(x, sortd, sigma=sigma) @ np.arange(len(x))

In [16]:
x = [5.0, -1.0, 9.5, 13.2, 16.2, 20.5, 42.0, 18.0]
np.set_printoptions(suppress=True)
print(neat_vec((diff_rank(matrices, x, sigma=0.1))))

1.00	0.00	2.00	3.00	4.00	7.00	5.00	6.00


In [29]:
np.set_printoptions(precision=8)
jac_rank = jacobian(diff_rank, argnum=1)
print(jac_rank(matrices, test, 0.2) )

[[-0. -0. -0.  0. -0. -0. -0. -0.]
 [-0. -0. -0.  0. -0. -0. -0. -0.]
 [-0. -0. -0. -0. -0.  0. -0. -0.]
 [ 0.  0.  0.  0.  0. -0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0. -0.  0.  0.  0.]
 [-0.  0.  0.  0.  0.  0.  0.  0.]]


# PyTorch example
We can verify that this is both parallelisable on the GPU and fully differentiable.

In [None]:
import torch
import numpy as np
from torch.autograd import Variable
import torch.nn.functional as F
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device:', device)

In [None]:
from differentiable_sorting_torch import diff_bisort, bitonic_matrices, diff_rank
matrices = bitonic_matrices(16)
torch_matrices = [[torch.from_numpy(matrix).float().to(device) for matrix in matrix_set] for matrix_set in matrices]


In [6]:
test_input = np.random.normal(0, 5, 16)
var_test_input = Variable(torch.from_numpy(test_input).float().to(device),
                          requires_grad=True)

result = diff_bisort(torch_matrices, var_test_input)

# compute the Jacobian of the sorting function, to show we can differentiate through the
# sorting function
jac = []
for i in range(len(result)):
    jac.append(
        torch.autograd.grad(result[i], var_test_input, retain_graph=True)[0])

# 16 x 16 jacobian of the sorting matrix
print(torch.stack(jac))

tensor([[2.5131e-03, 5.2202e-05, 6.7652e-05, 1.5480e-02, 1.4476e-04, 5.9317e-05,
         5.3141e-03, 7.3452e-04, 5.8550e-04, 1.5430e-04, 9.6412e-06, 6.1721e-07,
         9.7262e-01, 2.1492e-03, 5.2062e-06, 1.0513e-04],
        [9.0604e-02, 1.8423e-03, 2.3955e-03, 6.6582e-01, 4.7377e-03, 1.9671e-03,
         1.5011e-01, 2.2408e-02, 1.0846e-02, 2.4933e-03, 1.6341e-04, 1.0141e-05,
         2.1643e-02, 2.3574e-02, 6.5396e-05, 1.3220e-03],
        [1.7588e-01, 3.2152e-03, 4.0942e-03, 1.5777e-01, 1.0447e-02, 4.2611e-03,
         4.0763e-01, 5.2987e-02, 5.9876e-02, 1.4155e-02, 8.1245e-04, 4.9536e-05,
         2.8100e-03, 9.8990e-02, 3.7328e-04, 6.6456e-03],
        [2.3339e-01, 4.8711e-03, 6.2478e-03, 1.0802e-01, 1.4679e-02, 5.5586e-03,
         2.1918e-01, 6.2734e-02, 1.0191e-01, 2.0118e-02, 1.3029e-03, 7.9378e-05,
         1.5999e-03, 2.1119e-01, 4.7283e-04, 8.6520e-03],
        [1.6945e-01, 1.7494e-02, 2.2230e-02, 1.4854e-02, 4.5408e-02, 2.0010e-02,
         6.0525e-02, 1.2786e-01, 1.8481

In [9]:
result = diff_rank(torch_matrices, var_test_input)
print(result)


-1


RuntimeError: Device index must not be negative