Here are some utils for using ideas from "Categorical Reparameterization with Gumbel-Softmax" (Eric Jang, Shixiang Gu, Ben Poole)
https://arxiv.org/abs/1611.01144.  

The article deals with the problem of using stochastic categorical node in a neural network. Let's say we have a latent categorical variable, Z $\in$ $R^{C}$, where C is the number of categories. In normal settings, Z produces one-hot vectors with some probabilities: $\pi_{1}$, .. $\pi_{C}$. The problem is that stochastic node cannot be differentiated and it is not clear how to reparameterize it to make it possible. The article proposes using gumbel-softmax.

In variational autoencoder setting we have an encoder which models variational distribution. In the case of categorical random variable, it outputs $\pi_{1}$, $\pi_{2}$, .., $\pi_{C}$. No we want to sample from this distribution. The details are in the paper. 

Fitting the whole model comes down to minimzing the lowerbound of likelihood function or just minimizing the ELBO: <br>
ELBO = $E_{Y \sim q_{\phi}(y|x)}$[log($p_{\theta}$(x, y)) - log($q_{\phi}(y|x)$)] = $E_{Y \sim q_{\phi}(y|x)}$[log($p_{\theta}$(x|y))] - KL($q_{\phi}(y|x)$||$p_{\theta}$(y)) = <br>
Kullback-Leibler divergence can be written as: <br>
KL($q_{\phi}(y|x)$||$p_{\theta}$(y)) = $\sum_{i=1}^{C}$ $\pi_{i}(log(\pi_{i}) - log(\frac{1}{C}))$

In [2]:
import keras
import keras.layers as L
import tensorflow as tf

In [3]:
def kl_categorical(pi: tf.Tensor, n_classes: int) -> tf.Tensor:
    """
    computes Kullback-Leibler divergence between variational output and 
    uniform prior on categorical variable
    
    parameters
    ----------
    pi: is assumed to have rows whose sum is equal to 1 (output of softmax)
    n_classes: dimension of categorical variable
    """
    return tf.reduce_sum(pi * (tf.log(pi) - np.log(1 / n_classes)), axis=1) 

def sample_categorical(pi: tf.Tensor, temperature: float) -> tf.Tensor:
    """
    samples from categorical distribtion using gumbel-softmax trick. 
    
    parameters
    ----------
    pi: is assumed to have rows whose sum is equal to 1 (output of softmax)
    temperature: the lower it is the samples are closer to one-hot samples. 
    It must be bigger than 0. 
    """
    tensor_shape = K.shape(pi)
    batch_size, latent_dim = tensor_shape[0], tensor_shape[1]
    
    u = tf.random_uniform((batch_size, latent_dim), minval=0, maxval=1)
    u = tf.clip_by_value(u, clip_value_min=1e-09, clip_value_max=1-1e-09)
    g = -tf.log(-tf.log(u))
    
    s = tf.exp((tf.log(pi) + g) / temperature)
    return s / tf.expand_dims(tf.reduce_sum(s, axis=1), axis=-1)