# TensorFlow Ops
A scratchpad for working with lowlevel tf functions to determine how best to implement them.

In [1]:
import tensorflow as tf
import numpy as np

## Indexing

In [33]:
a = tf.placeholder(tf.int32, shape=[None, 2])
q_s = tf.placeholder(tf.float32, shape=[None, 4])

In [34]:
q_sa = tf.gather_nd(q_s, a)
sess = tf.Session()

In [35]:
q_s_t = np.random.rand(3, 4)
a_t = [[0, 1], [1, 3], [2, 2]]
print(q_s_t)
print(a_t)
fd = {q_s: q_s_t, a: a_t}
sess.run(q_sa, feed_dict=fd)

[[ 0.35143887  0.22088295  0.78431705  0.22921055]
 [ 0.21140967  0.82402759  0.79057975  0.01022777]
 [ 0.01357729  0.51656414  0.09268521  0.03675706]]
[[0, 1], [1, 3], [2, 2]]


array([ 0.22088295,  0.01022777,  0.09268521], dtype=float32)

In [61]:
a_t = np.asarray([2, 1, 3])
fd = {q_s: q_s_t, a: a_t}
try:
    sess.run(q_sa, feed_dict=fd)
except Exception as e:
    print(e)
    a_t = np.column_stack([np.arange(a_t.shape[0]), a_t])
    fd = {q_s: q_s_t, a: a_t}
    print(a_t)
    print(sess.run(q_sa, feed_dict=fd))

Cannot feed value of shape (3,) for Tensor 'Placeholder_6:0', which has shape '(?, 2)'
[[0 2]
 [1 1]
 [2 3]]
[ 0.78431708  0.8240276   0.03675706]


## Stable softmax (and cross-entropy)
The following was adapted from [this website](http://python.usyiyi.cn/documents/effective-tf/12.html). The final implementation in TensorFlow for the `tf.softmax_cross_entropy_with_logits` function can be found [here](https://github.com/tensorflow/tensorflow/blob/48be6a56d5c49d019ca049f8c48b2df597594343/tensorflow/compiler/tf2xla/kernels/softmax_op.cc).

The softmax operator is given by:

$f(\mathbf{x}) = \dfrac{exp(\mathbf{x})}{\sum_{x_i \in \mathbf{x}} exp(x_i)}$

where $\mathbf{x}$ is a vector composed of components ${x_i \in \mathbf{x}}$. Because the sum of all softmax components reduces to 1.0:

$\sum_{x_i \in \mathbf{x}} f(x_i) = \sum_{x_i \in \mathbf{x}} \dfrac{exp(x_i)}{\sum_{x_i \in \mathbf{x}} exp(x_i)}
= \left ( \dfrac{1}{\sum_{x_i \in \mathbf{x}} exp(x_i)} \right ) \sum_{x_i \in \mathbf{x}} exp(x_i)
= 1.0$

we often like to interpet the softmax output as probabilities, implying that its input $\mathbf{x}$ must represent the log probabilities, or logits:

$\dfrac{exp(x_i)}{\sum_{x_i \in \mathbf{x}} exp(x_i)} = p(x_i)$

$exp(x_i) = \left ( p(x_i) \right ) \left ( \sum_{x_i \in \mathbf{x}} exp(x_i) \right )$

$x_i = ln(\left ( p(x_i) \right ) \left ( \sum_{x_i \in \mathbf{x}} exp(x_i) \right ) )
    = ln(p(x_i)) + ln(\sum_{x_i \in \mathbf{x}} exp(x_i))
    = ln(p(x_i) + c$

(Note the addition of the constant $c$ to all logits. This arises from the fact that there are infinitely many solutions to a given probablity distribution, since $f(x+k) = f(x)$).
    
We can easily build a softmax operator in Tensorflow:

In [31]:
def naive_softmax(x):
    exp = tf.exp(x)
    z = tf.reduce_sum(tf.exp(x), axis=1, keep_dims=True)
    return exp / z

In [32]:
tf.reset_default_graph()
sess = tf.Session()
x = np.array([[0.5, 1.0, 3.0], [7.0, -2.0, 1.0]])
sess.run(tf.global_variables_initializer())
out = sess.run(naive_softmax(x))
print(out)
print(np.sum(out, axis=1))

[[  6.74253582e-02   1.11165622e-01   8.21409019e-01]
 [  9.97404592e-01   1.23089505e-04   2.47231880e-03]]
[ 1.  1.]


However, if the logits become too large (or small), the exponential exceeds the capacity of floating point representation, causing the softmax function to return 0.0 (if $exp(-\infty)$ in the numerator and/or $exp(\infty)$ in the denominator) or $\infty$ (if $exp(\infty)$ in the numerator).

In [33]:
x = np.array([[0.5, 1.0, 1000.0], [7.0, -1000.0, 1.0]])
out = sess.run(naive_softmax(x))
print(out)
print(np.sum(out, axis=1))

[[ 0.          0.                 nan]
 [ 0.99752738  0.          0.00247262]]
[ nan   1.]


We are usually willing to accept 0.0 in order to avoid `nan`, thus shifting the range of softmax to $[0, 1)$, which can be accomplished by simply subtracting the maximum value from each row. We can do this due to the property we found above: $f(x+k)=f(x)$, where $k=max(\mathbf{x})$.

In [60]:
def stable_softmax(x):
    max_value = tf.reduce_max(x, axis=1, keep_dims=True)
    x_shifted = x - max_value
    exp = tf.exp(x_shifted)
    z = tf.reduce_sum(exp, axis=1, keep_dims=True)
    return exp / z

In [51]:
x = np.array([[0.5, 1.0, 1000.0], [7.0, -1000.0, 1.0]])
out = sess.run(stable_softmax(x))
print(out)
print(np.sum(out, axis=1))

[[ 0.          0.          1.        ]
 [ 0.99752738  0.          0.00247262]]
[ 1.  1.]


While the softmax function is now stabilized, using it in a cross-entropy loss function becomes unstable. Cross-entropy is defined as:

$g(x) = \sum_{x_i \in \mathbf{x}}{-p(x_i)ln(p'(x_i))}$

where $p(x)$ is the true (target) probability distribution and $p'(x)$ is the predicted probability distribution. If $p'(x_i)=0$ for any $x_i \in \mathbf{x}$, then the cross-entropy blows up to $\infty$ due to the $ln(0)$ term.

In [46]:
def naive_cross_entropy(x, y):
    p = stable_softmax(x)
    xent = tf.multiply(y, tf.log(p))
    return -tf.reduce_sum(xent, axis=1)

In [48]:
x_stable = np.array([[0.5, 1.0, 3.0], [7.0, -2.0, 1.0]])
x_unstable = np.array([[0.5, 1.0, 1000.0], [7.0, -1000.0, 1.0]])
y = np.array([[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]])
out = sess.run([naive_cross_entropy(x_stable, y), naive_cross_entropy(x_unstable, y)])
print(out)

[array([ 2.1967341 ,  0.00259878]), array([ nan,  nan])]


We can circumvent this error by simply expanding the cross entropy expression:

$g(\mathbf{x}) = \sum_{x_i \in \mathbf{x}}{-p(x_i)ln(p'(x_i))}
= \sum_{x_i \in \mathbf{x}}{-p(x_i) ln \left(\dfrac{e^{x_i}}{\sum_{x_i \in \mathbf{x}}{e^{x_i}}}\right)}
= \sum_{x_i \in \mathbf{x}}{-p(x_i) \left(ln \left(e^{x_i} \right) - ln \left(\sum_{x_i \in \mathbf{x}}{e^{x_i}} \right) \right)}
= \sum_{x_i \in \mathbf{x}}{-p(x_i) \left(x_i - ln \left(\sum_{x_i \in \mathbf{x}}{e^{x_i}} \right) \right)}$

Now, because the stable softmax avoids $e^{\infty}$, the stable cross-entropy function only fails when all $x_i \in \mathbf{x}=0$, a criteria that cannot be satisfied if softmax is implemented properly. Remember that $\mathbf{x}$ in the expressions above is shifted by $max(\mathbf{x})$.

In [58]:
def stable_cross_entropy(x, y):
    max_value = tf.reduce_max(x, axis=1, keep_dims=True)
    x_shifted = x - max_value
    z = tf.reduce_sum(tf.exp(x_shifted), axis=1, keep_dims=True)
    xent = tf.multiply(y, (x_shifted - tf.log(z)))
    return -tf.reduce_sum(xent, axis=1)

In [59]:
out = sess.run([stable_cross_entropy(x_stable, y), stable_cross_entropy(x_unstable, y)])
print(out)

[array([ 2.1967341 ,  0.00259878]), array([  9.99000000e+02,   2.47568514e-03])]


Note from the above implementation that $z=\sum_{x_i \in \mathbf{x}}{e^{x_i}}$ is required for the stable cross-entropy implementation, a quantity lost in the softmax output. This is why the stable Tensorflow function `tf.softmax_cross_entropy_with_logits` takes logits, not softmax probabilities, as input, so that it can internally calculate $z$.