# Differentiable bitonic sort

[Bitonic sorts](https://en.wikipedia.org/wiki/Bitonic_sorter) allow creation of sorting networks with a sequence of fixed conditional swapping operations executed in parallel. A sorting network implements  a map from $\mathbb{R}^n \rightarrow \mathbb{R}^n$, where $n=2^k$ (sorting networks for non-power-of-2 sizes are possible but not trickier).

<img src="BitonicSort1.svg.png">

*[Image: from Wikipedia, by user Bitonic, CC0](https://en.wikipedia.org/wiki/Bitonic_sorter#/media/File:BitonicSort1.svg)*

The sorting network for $n=2^k$ elements has $\frac{k(k-1)}{2}$ "layers" where parallel compare-and-swap operations are used to rearrange a $k$ element vector into sorted order.

### Differentiable compare-and-swap

If we define the `softmax(a,b)` function (not the traditional "softmax" used for classification!) as the continuous approximation to the `max(a,b)` function:

$$\text{softmax}(a,b) = \log(e^a + e^b) \approx \max(a,b).$$

We can then fairly obviously write `softmin(a,b)` as:

$$\text{softmin}(a,b) = -\log(e^{-a} + e^{-b}) \approx \min(a,b).$$ More numerically stably we can write: 

$$\text{softmin}(a,b) = a + b - \text{softmax}(a,b).$$

These functions obviously aren't equal to max and min, but are relatively close, and differentiable. Note that we now have a differentiable compare-and-swap operation:

$$\text{high} = \text{softmax}(a,b), \text{low} = \text{softmin}(a,b), \text{where } \text{low}\leq \text{high}$$

Alternatively, we can use: 
$$\text{smoothmax}(a,b) = \frac{a (e^{\alpha a}) + b (e^{\alpha b})}{e^{\alpha a}+e^{\alpha b}}  \approx \max(a,b).$$  This has an adjustable smoothness parameter $\alpha$, with exact maximum as $\alpha \rightarrow \infty$ and pure averaging as $\alpha \rightarrow 0$.

## Differentiable sorting

For each layer in the sorting network, we can split all of the pairwise comparison-and-swaps into left-hand and right-hand sides which can be done simultaneously. We can any write function that selects the relevant elements of the vector as a multiply with a binary matrix.

For each layer, we can derive two binary matrices $L \in \mathbb{R}^{n \times \frac{n}{2}}$ and $R \in \mathbb{R}^{n \times \frac{n}{2}}$ which select the elements to be compared for the left and right hands respectively. This will result in the comparison between two $\frac{k}{2}$ length vectors. We can also derive two matrices $L' \in \mathbb{R}^{\frac{n}{2} \times n}$ and $R' \in \mathbb{R}^{\frac{n}{2} \times n}$ which put the results of the compare-and-swap operation back into the right positions.

Then, each layer $i$ of the sorting process is just:
$${\bf x}_{i+1} = L'_i[\text{softmin}(L_i{\bf x_i}, R_i{\bf x_i})] + R'_i[\text{softmax}(L_i{\bf x_i}, R_i{\bf x_i})]$$
$$ = L'_i\left(-\log\left(e^{-L_i{\bf x}_i} + e^{-R_i{\bf x}_i}\right)\right) +  R'_i\left(\log\left(e^{L_i{\bf x}_i} + e^{R_i{\bf x}_i}\right)\right)$$
which is clearly differentiable (though not very numerically stable -- the usable range of elements $x$ is quite limited in single float precision).

All that remains is to compute the matrices $L_i, R_i, L'_i, R'_i$ for each of the layers of the network. 

This process is excessively computation heavy, but easy to compute. We could also simplify this into two matrix multiplies, at the cost of a vector split and join in the middle (see the `woven` form later in this text). 

## Example

To sort four elements, we have a network like:

    0  1  2  3  
    ┕>>┙  │  │  
    │  │  ┕<<┙  
    ┕>>>>>┙  │  
    │  │  │  │  
    ┕>>┙  │  │  
    │  │  ┕>>┙  
    
This is equivalent to: 

    x[0], x[1] = cswap(x[0], x[1])
    x[3], x[2] = cswap(x[2], x[3])
    x[0], x[2] = cswap(x[0], x[2])
    x[0], x[1] = cswap(x[0], x[1])
    x[2], x[3] = cswap(x[2], x[3])
    
where `cswap(a,b) = (min(a,b), max(a,b))`

Replacing the indexing with matrix multiplies and `cswap` with a `softcswap = (softmin(a,b), softmax(a,b))` we then have the differentiable form.



# Test functions

In [1]:
from bitonic_tests import bitonic_network, pretty_bitonic_network

def neat_vec(n):
    # print a vector neatly    
    return "\t".join([f"{x:.2f}" for x in n])

# this should match the diagram at the top of the notebook
bitonic_network(16)

 0>1	 2<3	 4>5	 6<7	 8>9	10<11	12>13	14<15	
----------------------------------------------------------------
 0>2	 1>3	 4<6	 5<7	 8>10	 9>11	12<14	13<15	
 0>1	 2>3	 4<5	 6<7	 8>9	10>11	12<13	14<15	
----------------------------------------------------------------
 0>4	 1>5	 2>6	 3>7	 8<12	 9<13	10<14	11<15	
 0>2	 1>3	 4>6	 5>7	 8<10	 9<11	12<14	13<15	
 0>1	 2>3	 4>5	 6>7	 8<9	10<11	12<13	14<15	
----------------------------------------------------------------
 0>8	 1>9	 2>10	 3>11	 4>12	 5>13	 6>14	 7>15	
 0>4	 1>5	 2>6	 3>7	 8>12	 9>13	10>14	11>15	
 0>2	 1>3	 4>6	 5>7	 8>10	 9>11	12>14	13>15	
 0>1	 2>3	 4>5	 6>7	 8>9	10>11	12>13	14>15	
----------------------------------------------------------------


In [2]:
pretty_bitonic_network(16)

 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
 ╭──╯  │  │  │  │  │  │  │  │  │  │  │  │  │  │ 
 │  │  ╰──╮  │  │  │  │  │  │  │  │  │  │  │  │ 
 │  │  │  │  ╭──╯  │  │  │  │  │  │  │  │  │  │ 
 │  │  │  │  │  │  ╰──╮  │  │  │  │  │  │  │  │ 
 │  │  │  │  │  │  │  │  ╭──╯  │  │  │  │  │  │ 
 │  │  │  │  │  │  │  │  │  │  ╰──╮  │  │  │  │ 
 │  │  │  │  │  │  │  │  │  │  │  │  ╭──╯  │  │ 
 │  │  │  │  │  │  │  │  │  │  │  │  │  │  ╰──╮ 
 ╭─────╯  │  │  │  │  │  │  │  │  │  │  │  │  │ 
 │  ╭─────╯  │  │  │  │  │  │  │  │  │  │  │  │ 
 │  │  │  │  ╰─────╮  │  │  │  │  │  │  │  │  │ 
 │  │  │  │  │  ╰─────╮  │  │  │  │  │  │  │  │ 
 │  │  │  │  │  │  │  │  ╭─────╯  │  │  │  │  │ 
 │  │  │  │  │  │  │  │  │  ╭─────╯  │  │  │  │ 
 │  │  │  │  │  │  │  │  │  │  │  │  ╰─────╮  │ 
 │  │  │  │  │  │  │  │  │  │  │  │  │  ╰─────╮ 
 ╭──╯  │  │  │  │  │  │  │  │  │  │  │  │  │  │ 
 │  │  ╭──╯  │  │  │  │  │  │  │  │  │  │  │  │ 
 │  │  │  │  ╰──╮  │  │  │  │  │  │  │  │  │  │ 
 │  │  │  │  │  │  ╰

# Vectorised functions

## Testing sorting network

In [3]:
# Test sorting
import autograd.numpy as np # we can use plain numpy as well (but can't take grad!)


from differentiable_sorting import bitonic_matrices, diff_sort, diff_argsort
from differentiable_sorting import softmax, smoothmax, softmax_smooth
matrices = bitonic_matrices(8)

# test bitonic sorting with exact maximum
for i in range(10):
    # these should all be in sorted order
    test = np.random.randint(0, 200, 8)
    print(diff_sort(matrices, test, softmax=np.maximum))    

[ 21.  48.  55.  80. 133. 153. 187. 196.]
[ 22. 108. 135. 145. 149. 160. 164. 191.]
[ 11.  56.  63.  76.  88. 104. 114. 137.]
[ 24.  66. 109. 114. 125. 148. 150. 154.]
[ 25.  39.  54.  60.  68.  72. 123. 141.]
[  7.   7.  31.  51.  97. 116. 161. 162.]
[ 28.  29.  30.  40.  40.  60. 130. 171.]
[ 12.  34.  57.  73. 120. 148. 162. 166.]
[ 41.  52.  55.  61.  86.  91. 174. 177.]
[ 15.  81. 140. 140. 142. 166. 181. 186.]


In [4]:
for i in range(1, 11):
    k = 2**i
    matrices = bitonic_matrices(k)
    print(f"Testing sorting for {k} elements")
    for j in range(100):
        test = np.random.randint(0, 200, k)

        assert (np.allclose(diff_sort(matrices, test, softmax=np.maximum), np.sort(test)))

Testing sorting for 2 elements
Testing sorting for 4 elements
Testing sorting for 8 elements
Testing sorting for 16 elements
Testing sorting for 32 elements
Testing sorting for 64 elements
Testing sorting for 128 elements
Testing sorting for 256 elements
Testing sorting for 512 elements
Testing sorting for 1024 elements


## Differentiable sorting test

In [5]:
# Differentiable sorting 
np.set_printoptions(precision=2)
matrices = bitonic_matrices(8) 


for i in range(10):
    test = np.random.randint(-200,200,8)
    print("Softmax sorting   ", neat_vec(diff_sort(matrices, test, softmax=softmax)))
    print("Smoothmax sorting ", neat_vec(diff_sort(matrices, test, softmax=smoothmax)))
    print("Exact sorting     ", neat_vec(diff_sort(matrices, test, softmax=np.maximum)))
    
    print()

Softmax sorting    -160.00	-110.00	-48.00	-19.00	55.00	87.98	92.02	188.00
Smoothmax sorting  -160.00	-110.00	-48.00	-19.00	55.00	88.07	91.93	188.00
Exact sorting      -160.00	-110.00	-48.00	-19.00	55.00	88.00	92.00	188.00

Softmax sorting    -153.13	-150.87	-142.00	-72.00	-62.00	19.00	151.00	182.00
Smoothmax sorting  -152.76	-151.24	-142.00	-72.00	-62.00	19.00	151.00	182.00
Exact sorting      -153.00	-151.00	-142.00	-72.00	-62.00	19.00	151.00	182.00

Softmax sorting    -198.02	-193.98	-52.00	-45.00	-21.00	8.00	68.00	98.00
Smoothmax sorting  -197.93	-194.07	-51.99	-45.01	-21.00	8.00	68.00	98.00
Exact sorting      -198.00	-194.00	-52.00	-45.00	-21.00	8.00	68.00	98.00

Softmax sorting    -195.00	-100.00	-57.00	-40.00	77.00	86.00	168.00	183.00
Smoothmax sorting  -195.00	-100.00	-57.00	-40.00	77.00	86.00	168.00	183.00
Exact sorting      -195.00	-100.00	-57.00	-40.00	77.00	86.00	168.00	183.00

Softmax sorting    -141.00	-113.00	-106.00	-71.00	-59.00	18.00	31.00	176.00
Smoothmax sorting  -141

# Relaxed sorting
We can define a slighly modified function which interpolates between `softmax(a,b)` and `mean(a,b)`. The result is a sorting function that can be relaxed from sorting to averaging.

In [26]:

# Differentiable smoothed sorting 
test = np.random.randint(-200,200,8)
print(f"Mean {np.mean(test):.2f}")
print()
print("Exact sorting            ", neat_vec(diff_sort(matrices, test, np.maximum)))
print()
for smooth in np.linspace(0, 1, 8):    
    print(f"Softmax.   smooth[{smooth:.2f}]  ", neat_vec(diff_sort(matrices, test, lambda a,b:softmax_smooth(a,b,smooth=smooth))))
    # smoothmax's alpha is the inverse of diff_bisort_smooth
    print(f"Smoothmax. alpha=[{1-smooth:.2f}]  ", neat_vec(diff_sort(matrices, test, lambda a,b:smoothmax(a,b, alpha=1-smooth))))
    print()

Mean 25.88

Exact sorting             -198.00	-103.00	-94.00	5.00	77.00	122.00	199.00	199.00

Softmax.   smooth[0.00]   -198.00	-103.00	-94.00	5.00	77.00	122.00	198.31	199.69
Smoothmax. alpha=[1.00]   -198.00	-103.00	-94.00	5.00	77.00	122.00	199.00	199.00

Softmax.   smooth[0.14]   -114.33	-64.29	-40.38	-8.86	77.52	96.37	124.71	136.26
Smoothmax. alpha=[0.86]   -198.00	-103.00	-94.00	5.00	77.00	122.00	199.00	199.00

Softmax.   smooth[0.29]   -58.80	-35.31	-7.76	-6.37	68.24	73.85	80.37	92.77
Smoothmax. alpha=[0.71]   -198.00	-102.99	-94.01	5.00	77.00	122.00	199.00	199.00

Softmax.   smooth[0.43]   -22.37	-10.52	-1.25	9.08	48.99	50.76	65.21	67.10
Smoothmax. alpha=[0.57]   -198.00	-102.95	-94.05	5.00	77.00	122.00	199.00	199.00

Softmax.   smooth[0.57]   0.27	5.34	5.70	11.94	40.34	41.29	50.82	51.32
Smoothmax. alpha=[0.43]   -198.00	-102.81	-94.19	5.00	77.00	122.00	199.00	199.00

Softmax.   smooth[0.71]   12.10	13.03	12.44	13.37	37.06	37.41	40.61	40.98
Smoothmax. alpha=[0.29]   -198.00	-102.

In [27]:
from autograd import jacobian
# show that we can take the derivative
jac_sort = jacobian(diff_sort, argnum=1)
jac_sort(matrices, test, softmax=lambda a,b:softmax_smooth(a,b,0.05)) # slight relaxation

array([[0.001, 0.022, 0.001, 0.003, 0.045, 0.044, 0.86 , 0.024],
       [0.003, 0.003, 0.023, 0.078, 0.022, 0.804, 0.043, 0.024],
       [0.022, 0.021, 0.023, 0.803, 0.025, 0.079, 0.004, 0.023],
       [0.023, 0.002, 0.001, 0.021, 0.86 , 0.025, 0.045, 0.024],
       [0.86 , 0.045, 0.045, 0.024, 0.022, 0.001, 0.001, 0.001],
       [0.045, 0.024, 0.86 , 0.024, 0.002, 0.023, 0.001, 0.023],
       [0.003, 0.027, 0.023, 0.023, 0.023, 0.023, 0.023, 0.855],
       [0.044, 0.856, 0.024, 0.024, 0.002, 0.001, 0.023, 0.027]])

In [28]:
# show that we can take the derivative, applying some smoothing to get reasonable values
print(diff_sort(matrices, test, smoothmax))
jac_sort(matrices, test,  smoothmax) 

[-198.    -102.999  -94.001    5.      77.     122.     199.     199.   ]


array([[ 0.   , -0.   ,  0.   ,  0.   , -0.   ,  0.   ,  1.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  1.001, -0.   , -0.001, -0.   ,  0.   ],
       [ 0.   , -0.   ,  0.   , -0.001,  0.   ,  1.001, -0.   ,  0.   ],
       [ 0.   , -0.   ,  0.   ,  0.   ,  1.   , -0.   ,  0.   , -0.   ],
       [ 1.   ,  0.   ,  0.   , -0.   , -0.   ,  0.   ,  0.   ,  0.   ],
       [-0.   ,  0.   ,  1.   ,  0.   ,  0.   , -0.   , -0.   ,  0.   ],
       [ 0.   ,  0.5  , -0.   ,  0.   ,  0.   , -0.   , -0.   ,  0.5  ],
       [ 0.   ,  0.5  , -0.   ,  0.   ,  0.   , -0.   , -0.   ,  0.5  ]])

## Woven form
We can "weave" the four matrices into two matrices for fewer multiplies at the cost of having to split and join the matrices at each layer.

In [29]:
from differentiable_sorting import bitonic_woven_matrices, diff_sort_weave

woven_matrices = bitonic_woven_matrices(8)

print("Exact sorting       ", neat_vec(diff_sort(matrices, test, np.maximum)))
print(f"Diff. (std.)       ", neat_vec(diff_sort(matrices, test, smoothmax)))
print(f"Diff. (woven)      ", neat_vec(diff_sort_weave(woven_matrices, test, smoothmax)))
        

Exact sorting        -198.00	-103.00	-94.00	5.00	77.00	122.00	199.00	199.00
Diff. (std.)        -198.00	-103.00	-94.00	5.00	77.00	122.00	199.00	199.00
Diff. (woven)       -198.00	-103.00	-94.00	5.00	77.00	122.00	199.00	199.00


## Differentiable ranking / argsort
We can use a differentiable similarity measure between the input and output of the vector, e.g. an RBF kernel. We can use this to generate a normalised similarity matrix and apply this to a vector `[1, 2, 3, ..., n]`. This gives a differentiable ranking function.

As `sigma` gets larger, the result converges to giving all values the mean rank; as it goes to zero the result converges to the true rank.

In [30]:
from differentiable_sorting import order_matrix, diff_argsort

In [31]:
matrices = bitonic_matrices(8)

In [32]:
x = [5.0, -1.0, 9.5, 13.2, 16.2, 10.5, 42.0, 18.0]
np.set_printoptions(suppress=True)
print(x)
# show argsort
ranks = diff_argsort(matrices, x, sigma=0.5)
print(neat_vec(ranks))
print(np.argsort(ranks))

[5.0, -1.0, 9.5, 13.2, 16.2, 10.5, 42.0, 18.0]
1.00	0.00	2.05	4.00	5.00	2.97	7.00	6.00
[1 0 2 5 3 4 7 6]


In [33]:
# we now have differentiable argmax and argmin by indexing the rank vector
print(np.argmin(x), int(ranks[0]+0.5))
print(np.argmax(x), int(ranks[-1]+0.5))

1 1
6 6


In [34]:
print("Smoothed ranks")
test = x
for sigma in [0.1, 1, 10, 100, 1000]:     
    ranks = diff_argsort(matrices, test, sigma=sigma) 
    print(f"sigma={sigma:7.1f}  |", neat_vec(ranks))

Smoothed ranks
sigma=    0.1  | 1.00	0.00	2.00	4.00	5.00	3.00	7.00	6.00
sigma=    1.0  | 1.00	0.00	2.33	3.97	5.12	2.73	7.00	5.85
sigma=   10.0  | 2.55	1.92	3.01	3.38	3.65	3.11	6.79	3.82
sigma=  100.0  | 3.47	3.45	3.48	3.49	3.49	3.48	3.56	3.50
sigma= 1000.0  | 3.50	3.50	3.50	3.50	3.50	3.50	3.50	3.50


In [35]:
np.set_printoptions(precision=3)
jac_rank = jacobian(diff_argsort, argnum=1)
print(jac_rank(matrices, np.array(test), 1.0) )

[[ 0.001 -0.    -0.001 -0.    -0.    -0.    -0.    -0.   ]
 [-0.     0.    -0.    -0.    -0.    -0.    -0.    -0.   ]
 [-0.003 -0.     0.233 -0.022 -0.002 -0.206 -0.    -0.   ]
 [-0.001 -0.    -0.031  0.146 -0.033 -0.076 -0.    -0.005]
 [-0.    -0.    -0.002 -0.032  0.224 -0.002 -0.    -0.188]
 [-0.003 -0.    -0.209 -0.066 -0.004  0.283 -0.    -0.001]
 [-0.    -0.    -0.    -0.    -0.    -0.     0.    -0.   ]
 [-0.    -0.    -0.001 -0.013 -0.191 -0.002 -0.     0.207]]


In [36]:
matrices = bitonic_matrices(8)

x = [1, 2, 3, 4, 8, 7, 6, 4]
ranks = diff_argsort(matrices, x, sigma=0.25)
print(neat_vec(ranks))
print(np.argsort(ranks))

print(jac_rank(matrices, np.array(x), 0.25))

0.13	1.09	2.00	3.11	6.99	6.00	5.00	3.11
[0 1 2 3 7 6 5 4]
[[ 2.162 -1.059 -0.523 -0.287 -0.01  -0.018 -0.056 -0.21 ]
 [-0.066  0.562 -0.186 -0.155 -0.005 -0.011 -0.035 -0.105]
 [-0.012 -0.013  0.041 -0.005 -0.    -0.001 -0.002 -0.008]
 [-0.012 -0.025 -0.108  0.564 -0.05  -0.086 -0.141 -0.14 ]
 [-0.001 -0.001 -0.003 -0.005  0.104 -0.058 -0.028 -0.008]
 [-0.    -0.001 -0.002 -0.004 -0.001  0.028 -0.012 -0.007]
 [-0.    -0.    -0.001 -0.002 -0.016 -0.018  0.038 -0.001]
 [-0.012 -0.025 -0.108 -0.209 -0.05  -0.086 -0.141  0.633]]


# PyTorch example
We can verify that this is both parallelisable on the GPU and fully differentiable.

In [81]:
import torch
import numpy as np
from torch.autograd import Variable
import torch.nn.functional as F
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device:', device)

Device: cuda:0


In [82]:
from differentiable_sorting_torch import softmax, diff_argsort
from differentiable_sorting import diff_sort
matrices = bitonic_matrices(16)
torch_matrices = [[torch.from_numpy(matrix).float().to(device) for matrix in matrix_set] for matrix_set in matrices]


In [83]:
test_input = np.random.normal(0, 5, 16)
var_test_input = Variable(torch.from_numpy(test_input).float().to(device),
                          requires_grad=True)

result = diff_sort(torch_matrices, var_test_input, softmax=softmax)

# compute the Jacobian of the sorting function, to show we can differentiate through the
# sorting function
jac = []
for i in range(len(result)):
    jac.append(
        torch.autograd.grad(result[i], var_test_input, retain_graph=True)[0])

# 16 x 16 jacobian of the sorting matrix
print(torch.stack(jac))

tensor([[4.5973e-04, 3.3933e-02, 7.3613e-01, 5.7220e-06, 7.3584e-03, 1.8311e-02,
         1.3363e-02, 2.3068e-03, 1.7365e-01, 5.7429e-03, 7.0045e-06, 2.8817e-05,
         4.1973e-03, 3.6804e-03, 3.1447e-04, 5.0413e-04],
        [4.8785e-04, 4.3964e-02, 1.9875e-01, 7.2420e-06, 1.1447e-02, 3.0540e-02,
         2.0577e-02, 3.3834e-03, 6.5341e-01, 1.7145e-02, 2.0131e-05, 8.2905e-05,
         9.5225e-03, 8.5303e-03, 8.1938e-04, 1.3184e-03],
        [4.2167e-03, 2.7968e-01, 2.7102e-02, 5.4494e-05, 7.8724e-02, 2.0756e-01,
         1.4958e-01, 2.2536e-02, 7.6766e-02, 4.5877e-02, 7.1609e-05, 3.0935e-04,
         5.2884e-02, 4.5629e-02, 3.4320e-03, 5.5709e-03],
        [5.1660e-03, 2.5247e-01, 2.4786e-02, 6.6891e-05, 7.6780e-02, 1.9704e-01,
         1.5687e-01, 2.1487e-02, 5.0643e-02, 5.9974e-02, 7.6715e-05, 3.4016e-04,
         7.7280e-02, 6.5229e-02, 4.5771e-03, 7.2146e-03],
        [1.4384e-02, 9.4579e-02, 3.0093e-03, 1.8639e-04, 1.0764e-01, 1.2278e-01,
         1.3674e-01, 5.6585e-02, 1.1777

In [84]:
result = diff_argsort(torch_matrices, var_test_input)
print(result)


tensor([11.0000,  3.4317,  0.0000, 14.0000,  6.0228,  3.5761,  4.7984,  9.0000,
         1.0000,  5.9649, 14.0000, 13.0000,  5.9053,  7.6375, 11.0000, 10.0000],
       device='cuda:0', grad_fn=<MvBackward>)
