## Two-level perceptron for reinforcement learning

In [1]:
import numpy as np

In [2]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

### Bit-encoding of input variables $\xi$

The input variables, position, velocity and target position and velocity must be binary-encoded. This is an issue for quantities that are expressed with floating point numbers that carry 64 bits of information. The good thing is that we only have to encode the information. We will be able to train the neural network regardless of whether the internal encoding of the network makes sense to us.

In [52]:
x = np.array(-2.3)

In [79]:
def floats_to_bits(x):
    return np.unpackbits(np.frombuffer(x.tobytes(), dtype=np.uint8))

def bits_to_floats(x):
    return np.frombuffer(np.packbits(x).tobytes(), dtype=np.float)

In [80]:
y = floats_to_bits(np.array([3.2, 2.3]))

In [83]:
print(y, y.shape)

[1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1
 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0] (128,)


In [82]:
bits_to_floats(y)

array([ 3.2,  2.3])

### Encoding

The following four variables will be encoded in the input vector:

* The three coordinates of origin $\xi_0$
* The three components of the currrent velocity vector $\vec v_0$
* The three coordinates of the target location $\xi_1$
* The three components of the target velocity vector at the target location $\vec v_1$

This means 12 floating point numbers, hence 768 bits

Assume that we want to go from (0,0,0) with velocity (0,0,0) to (1,0,0) with velocity (0,0,0). The encoded input will be therefore

In [144]:
xi = floats_to_bits(np.array([0,0,0,0,0,0,1.0,0,0,0,0,0]))

In [145]:
print(len(xi))

768


At the same time, we expect that the output of the neural networ are four variables, the thrust every motor has to apply in each step 

## Assembling the neural network

The equations for the neural network are the following

$$\begin{align}
z & = sigmoid(p) \\
p & = W_2 \cdot y \\
y & = ReLU(x) \\
x & = W_1 \cdot \xi
\end{align}
$$

It is also important to obtain the partial derivatives for the weights optimization of the loss function $L = (d-z)^2/2$

$$\partial_{W_2} L = (d-z)(1-z)zp$$
$$\partial_{W_1} L = (d-z)(1-z)z W_2 \partial_x ReLU(x) \xi $$

In [146]:
W1 = np.random.randn(768, 768)

In [147]:
W2 = np.random.randn(256, 768)

In [148]:
x = W1 @ xi.astype(np.float)

In [149]:
x[x < 0] = 0 # y is x inplace

In [150]:
p = W2 @ x

In [151]:
z = sigmoid(p)

In [152]:
bits_to_floats(z.astype(np.uint8))

array([  6.88682841e-022,   5.63758775e-231,   2.74596265e-289,
         8.76964545e-193])

The goal is to get four values of thrust that make sense. For that, every time we have an output that can be compared to the loss function, we change the weights with

$$W_2 = W_2 - r \partial_{W_2} L$$

$$W_1 = W_1 - r \partial_{W_1} L$$

We can start training our system with a known solution, wich is stationary flight. This gives an equal thrust for each of rour rotors of $F_i = mg/4$. Assuming a mass of 