<a href="https://colab.research.google.com/github/probabll/dgm4nlp/blob/master/notebooks/sst/SST_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will need to import some helper code, so we need to run this

In [0]:
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

# Colab

On [colab](https://colab.research.google.com) you can point to a notebook on our [github repo](https://github.com/probabll/dgm4nlp) then you can run the following:

In [0]:
using_colab = True

In [5]:
if using_colab:
  !rm -fr dgm4nlp sst
  !git clone https://github.com/probabll/dgm4nlp.git
  !cp -R dgm4nlp/notebooks/sst ./  
  !ls

Cloning into 'dgm4nlp'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 43 (delta 11), reused 20 (delta 2), pack-reused 0[K
Unpacking objects: 100% (43/43), done.
dgm4nlp  sample_data  sst


Now we can start our lab.

In [0]:
import torch
from torch import nn
# CPU should be fine for this lab
device = torch.device('cpu')  
#device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
from collections import OrderedDict
import numpy as np

# Sentiment Classification 


We are going to augment a sentiment classifier with a layer of discrete latent variables which will help us improve the model's interpretability. But first, let's quickly review the baseline task.


In sentiment classification, we have some text input $x = \langle x_1, \ldots, x_n \rangle$, e.g. a sentence or short paragraph, which expresses a certain sentiment $y$, i.e. one of $K$ classes, towards a subject (e.g. a film or a product). 



We can learn a sentiment classifier by learning a categorical distribution over classes for a given input:

\begin{align}
Y|x &\sim \text{Cat}(f(x; \theta))
\end{align}

A categorical distribution over $K$ classes is parameterised by a $K$-dimensional probability vector, here we use a neural network $f$ to map from the input to this probability vector. Technically we say *a neural network parameterises our model*, that is, it computes the parameters of our categorical observation model. The figure below is a graphical depiction of the model: circled nodes are random variables (a shaded node is an observed variable), uncircled nodes are deterministic, a plate indicates multiple draws.

<img src="https://github.com/probabll/dgm4nlp/raw/master/notebooks/sst/img/classifier.png"  height="100">

The neural network (NN) $f(\cdot; \theta)$ has parameters of its own, i.e. the weights of the various architecture blocks used, which we denoted generically by $\theta$.

Suppose we have a dataset $\mathcal D = \{(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)})\}$ containing $N$ i.i.d. observations. Then we can use the log-likelihood function 
\begin{align}
\mathcal L(\theta|\mathcal D) &= \sum_{k=1}^{N} \log P(y^{(k)}|x^{(k)}, \theta) \\
&= \sum_{k=1}^{N} \log \text{Cat}(y^{(k)}|f(x^{(k)}; \theta))
\end{align}
 to estimate $\theta$ by maximisation:
 \begin{align}
 \theta^\star = \arg\max_{\theta \in \Theta} \mathcal L(\theta|\mathcal D)
 \end{align}
 

We can use stochastic gradient-ascent to find a local optimum of $\mathcal L(\theta|\mathcal D)$, which only requires a gradient estimate:

\begin{align}
\mathcal L(\theta|\mathcal D) &= \sum_{k=1}^{|\mathcal D|} \log P(y^{(k)}|x^{(k)}, \theta) \\ 
&= \sum_{k=1}^{|\mathcal D|} \frac{1}{N} N \log P(y^{(k)}|x^{(k)}, \theta)  \\
&= \mathbb E_{\mathcal U(1/N)} \left[ N \log P(y^{(K)}|x^{(K)}, \theta) \right]  \\
&\overset{\text{MC}}{\approx} \frac{N}{M} \sum_{s=1}^M \log P(y^{(k_s)}|x^{(k_s)}, \theta) \\
&\text{where }K_s \sim \mathcal U(1/N)
\end{align}

This is a Monte Carlo (MC) estimate of the gradient computed on $M$ data points selected uniformly at random from $\mathcal D$.

For as long as $f$ remains differentiable wrt to its inputs and parameters, we can rely on automatic differentiation to obtain gradient estimates.

In what follows we show how to design $f$ and how to extend this basic model to a latent-variable model.



## Architecture


The function $f$ conditions on a high-dimensional input (i.e. text), so we need to convert it to continuous real vectors. This is the job an *encoder*. 

**Embedding Layer**

The first step is to convert the words in $x$ to vectors, which in this lab we will do with a pre-trained embedding layer (we will use GloVe).

We will denote the embedding of the $i$th word of the input by:

\begin{equation}
\mathbf x_i = \text{glove}(x_i)
\end{equation}

**Encoder Layer**

In this lab, an encoder takes a sequence of input vectors $\mathbf x_1^n$, each $I$-dimensional, and produces a sequence of output vectors $\mathbf t_1^n$, each $O$-dimensional and a summary vector $\mathbf h \in \mathbb R^O$:

\begin{equation}
    \mathbf t_1^n, \mathbf h = \text{encoder}(\mathbf x_1^n; \theta_{\text{enc}})
\end{equation}

where we use $\theta_{\text{enc}}$ to denote the subset of parameters in $\theta$ that are specific to this encoder block. 

*Remark:* in practice for a correct batched implementation, our encoders also take a mask matrix and a vector of lengths.

Examples of encoding functions can be a feed-forward NN (with an aggregator based on sum or average/max pooling) or a recurrent NN (e.g. an LSTM/GRU). Other architectures are also possible.

**Output Layer**

From our summary vector $\mathbf h$, we need to parameterise a categorical distribution over $K$ classes, thus we use

\begin{align}
f(x; \theta) &= \text{softmax}(\text{dense}_K(\mathbf h; \theta_{\text{output}}))
\end{align}

where $\text{dense}_K$ is a dense layer with $K=5$ outputs and $\theta_{\text{output}}$ corresponds to its parameters (weight matrix and bias vector). Note that we need to use the softmax activation function in order to guarantee that the output of $f$ is a normalised probability vector.


## Implementation

We provide a few encoders which implement the following abstract Encoder class:

In [0]:
class Encoder(nn.Module):
    """
    For you to focus on DGMs and abstract away from certain architecture details, 
     we will be providing some helper classes.
     
    An encoder is one of them.
    """
    
    
    def __init__(self):
        super(Encoder, self).__init__()
        
    def forward(self, inputs, mask, lengths):
        """
        The inputs are batch-first tensors
        
        :param inputs: [B, T, d]
        :param mask: [B, T]
        :param lengths: [B]
        :returns: [B, T, d], [B, d]
            where the first tensor is the transformed input
            and the second tensor is a summary of all inputs
        """
        pass
        

We will mostly use a bag-of-words encoder (to keep everything lightweight), but we also provide a feed-forward and an LSTM encoder for you:

In [0]:
class Passthrough(Encoder):
    """
    This encoder does not do anything, it simply passes the input forward and summarises 
        them via a sum.
    """
    
    def __init__(self):
        super(Passthrough, self).__init__()
        
    def forward(self, inputs, mask, lengths, **kwargs):
        # inputs: [B, T, d]
        # mask: [B, T]
        # lengths: [B]
        
        # [B, T, d], [B, d]
        return inputs, (inputs * mask.unsqueeze(-1).float()).sum(dim=1) 

    
class FFEncoder(Encoder):
    """
    A typical feed-forward NN with tanh hidden activations.
    """
    
    def __init__(self, input_size, output_size, 
                 activation=None, 
                 hidden_sizes=[], 
                 aggregator='sum',
                 dropout=0.5):
        """
        :param input_size: int
        :param output_size: int
        :param hidden_sizes: list of integers (dimensionality of hidden layers)
        :param aggregator: 'sum' or 'avg'
        :param dropout: dropout rate
        """
        super(FFEncoder, self).__init__()
        layers = []
        if hidden_sizes:                    
            for i, size in enumerate(hidden_sizes):
                if dropout > 0.:
                  layers.append(('dropout%d' % i, nn.Dropout(p=dropout)))
                layers.append(('linear%d' % i, nn.Linear(input_size, size)))
                layers.append(('tanh%d' % i, nn.Tanh()))
                input_size = size
        if dropout > 0.:
          layers.append(('dropout', nn.Dropout(p=dropout)))
        layers.append(('linear', nn.Linear(input_size, output_size)))       
        self.layer = nn.Sequential(OrderedDict(layers))     
        self.activation = activation
        if not aggregator in ['sum', 'avg']:
            raise ValueError("I can only aggregate outputs using 'sum' or 'avg'")
        self.aggregator = aggregator
        
    def forward(self, x, mask, lengths):
        # [B, T, d]
        y = self.layer(x)
        if not self.activation is None:
            y = self.activation(y)
        # [B, d]
        s = (y * mask.unsqueeze(-1).float()).sum(dim=1)
        if self.aggregator == 'avg':
            s /= lengths.unsqueeze(-1).float()
        return y, s


from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


class LSTMEncoder(Encoder):
    """
    This module encodes a sequence into a single vector using an LSTM,
     it also returns the hidden states at each time step.
    """

    def __init__(self, in_features, hidden_size: int = 200,
                 batch_first: bool = True,
                 bidirectional: bool = True):
        """
        :param in_features:
        :param hidden_size:
        :param batch_first:
        :param bidirectional:
        """
        super(LSTMEncoder, self).__init__()
        self.lstm = nn.LSTM(in_features, hidden_size, batch_first=batch_first,
                            bidirectional=bidirectional)

    def forward(self, x, mask, lengths):
        """
        Encode sentence x
        :param x: sequence of word embeddings, shape [B, T, E]
        :param mask: byte mask that is 0 for invalid positions, shape [B, T]
        :param lengths: the lengths of each input sequence [B]
        :return:
        """

        packed_sequence = pack_padded_sequence(x, lengths, batch_first=True)
        outputs, (hx, cx) = self.lstm(packed_sequence)
        outputs, _ = pad_packed_sequence(outputs, batch_first=True)

        # classify from concatenation of final states
        if self.lstm.bidirectional:
            final = torch.cat([hx[-2], hx[-1]], dim=-1)
        else:  # classify from final state
            final = hx[-1]

        return outputs, final
    
    
def get_encoder(layer, in_features, hidden_size, bidirectional=True):
    """Returns the requested layer."""

    # TODO: make pass and average layers
    if layer == "pass":
        return Passthrough()
    elif layer == 'ff':
        return FFEncoder(in_features, 2 * hidden_size, hidden_sizes=[hidden_size], aggregator='sum')
    elif layer == "lstm":
        return LSTMEncoder(in_features, hidden_size,
                           bidirectional=bidirectional)
    else:
        raise ValueError("Unknown layer")

# Sentiment Classification with Latent Rationale

A latent rationale is a compact and informative fragment of the input based on which a NN classifier makes its decisions. [Lei et al (2016)](http://aclweb.org/anthology/D16-1011) proposed to induce such rationales along with a regression model for multi-aspect sentiment analsysis, their model is trained via REINFORCE on a dataset of beer reviews.

*Remark:* the model we will develop here can be seen as a probabilistic version of their model. The rest of this notebook focus on our own probabilitisc view of the model.

The picture below depicts our latent-variable model for rationale extraction:

<img src="https://github.com/probabll/dgm4nlp/raw/master/notebooks/sst/img/rationale.png"  height="200">

where we augment the model with a collection of latent variables $z = \langle z_1, \ldots, z_n\rangle$ where $z_i$ is a binary latent variable. Each latent variable $z_i$ regulates whether or not the input $x_i$ is available to the classifier.  We use $x \odot z$ to denote the selected words, which, in the terminology of Lei et al, is a latent rationale.

Again the classifier parameterises a Categorical distribution over $K=5$ outcomes, though this time it can encode only a selection of the input:

\begin{align}
    Z_i & \sim \text{Bern}(p_1) \\
    Y|z,x &\sim \text{Cat}(f(x \odot z; \theta))
\end{align}

where we have a shared and fixed Bernoulli prior (with parameter $p_1$) for all $n$ latent variables.

Here is an example design for $f$:

\begin{align}
\mathbf x_i &= z_i \, \text{glove}(x_i) \\
\mathbf t_1^n, \mathbf h &= \text{encoder}(\mathbf x_1^n; \theta_{\text{enc}}) \\
f(x \odot z; \theta) &= \text{softmax}(\text{dense}_K(\mathbf h; \theta_{\text{output}}))
\end{align}

where:
* $z_i$ either leaves $\mathbf x_i$ unchanged or turns it into a vector of zeros;
* the encoder only sees features from selected inputs, i.e. $x_i$ for which $z_i = 1$;
* $\text{dense}_K$ is a linear layer with $K=5$ outputs.



## Prior


Our prior is a Bernoulli with fixed parameter $0 < p_1 < 1$:

\begin{align}
Z_i & \sim \text{Bern}(p_1)
\end{align}

As we will be using Bernoulli priors and posteriors, it is a good idea to implement a Bernoulli class:

In [0]:
from torch.distributions import Bernoulli as PyTBernoulli

class Bernoulli:
    """
    This class encapsulates a collection of Bernoulli distributions. 
    Each Bernoulli is uniquely specified by p_1, where
        Bernoulli(X=x|p_1) = pow(p_1, x) + pow(1 - p_1, 1 - x)
    is the Bernoulli probability mass function (pmf).    
    """
    
    def __init__(self, logits=None, probs=None):
        """
        We can specify a Bernoulli distribution via a logit or a probability. 
         You need to specify at least one, and if you specify both, beware that
         in this implementation logits will be used.
         
        Recall that: probs = sigmoid(logits).
         
        :param logits: a tensor of logits (a logit is defined as log (p_1/p_0))
            where p_0 = 1 - p_1
        :param probs: a tensor of probabilities, each in (0, 1)
        
        """        
        if probs is None and logits is None:
            raise ValueError('I need probabilities or logits')        
        if logits is None:            
            self.probs = probs
        else:
            #self._bernoulli = PyTBernoulli(logits=logits)
            self.probs = torch.sigmoid(logits)
    
    def sample(self):
        """Returns a sample with the same shape as the parameters"""
        #return self._bernoulli.sample([1]).squeeze(0)
        #print('<0', (self.probs < 0).sum(), '>1', (self.probs > 1).sum())
        return torch.bernoulli(self.probs)
    
    def log_pmf(self, x):
        """
        Assess the log probability of a sample. 
        :param x: either a single sample (0 or 1) or a tensor of samples with the same shape as the parameters.
        :returns: tensor with log probabilities with the same shape as parameters
            (if the input is a single sample we broadcast it to the shape of the parameters)
        """
        # x * torch.log(self.probs) + (1 - x) * torch.log(1. - self.probs)
        return torch.where(x == 1., torch.log(self.probs), torch.log(1. - self.probs))
    
    def kl(self, other: 'Bernoulli'):
        """
        Compute the KL divergence between two Bernoulli distributions (from self to other).
        
        :return: KL[self||other] with same shape parameters
        """
        p1 = self.probs
        p0 = 1. - self.probs
        q1 = other.probs
        q0 = 1. - other.probs        
        return p1 * (torch.log(p1) - torch.log(q1)) + p0 * (torch.log(p0) - torch.log(q0))


## Classifier

The classifier encodes only a selection of the input, which we denote $x \odot z$, and parameterises a Categorical distribution over $5$ outcomes (sentiment levels).

Thus let's implement a Categorical distribution (we will only need to be able to assess its lgo pmf):

In [0]:
class Categorical:
    
    def __init__(self, log_probs):
        # [B, K]: class probs
        self.log_probs = log_probs
        
    def log_pmf(self, y):
        """
        :param y: [B] integers
        """
        return torch.gather(self.log_probs, 1, y.unsqueeze(-1))

and a classifier architecture:

In [0]:
class Classifier(nn.Module):
    """
    The Encoder takes an input text (and rationale z) and computes p(y|x,z)
    """

    def __init__(self,
                 embed:        nn.Embedding = None,
                 hidden_size:  int = 200,
                 output_size:  int = 1,
                 dropout:      float = 0.1,
                 layer:        str = "pass",
                 ):

        super(Classifier, self).__init__()

        emb_size = embed.weight.shape[1]
        enc_size = hidden_size * 2
        self.embed_layer = nn.Sequential(
            embed,
            nn.Dropout(p=dropout)
        )

        self.enc_layer = get_encoder(layer, emb_size, hidden_size)

        self.output_layer = nn.Sequential(
            nn.Dropout(p=dropout),
            nn.Linear(enc_size, output_size),
            nn.LogSoftmax(dim=-1)
        )

        self.report_params()

    def report_params(self):
        count = 0
        for name, p in self.named_parameters():
            if p.requires_grad and "embed" not in name:
                count += np.prod(list(p.shape))
        print("{} #params: {}".format(self.__class__.__name__, count))

    def forward(self, x, mask, z) -> Categorical:

        rnn_mask = mask
        emb = self.embed_layer(x)

        # [B, T]
        rnn_mask = z > 0.
        # [B, T, 1]
        z_mask = z.unsqueeze(-1).float()
        # [B, T, E]
        emb = emb * z_mask

        lengths = mask.long().sum(1)

        # encode the sentence
        _, final = self.enc_layer(emb, rnn_mask, lengths)

        # predict sentiment from final state(s)
        log_probs = self.output_layer(final)        
        return Categorical(log_probs)

## Inference


Computing the log-likelihood of an observation requires marginalising over assignments over $z$:

\begin{align}
P(y|x,\theta,p_1) &= \sum_{z_1 = 0}^1 \cdots \sum_{z_n=0}^1 P(z|p_1)\times P(y|x,z, \theta) \\
&= \sum_{z_1 = 0}^1 \cdots \sum_{z_n=0}^1 \left( \prod_{i=1}^n \text{Bern}(z_i|p_1)\right) \times \text{Cat}(y|f(x \odot z; \theta)) 
\end{align}

This is clearly intractable: there are $2^n$ possible assignments to $z$ and because the classifier conditions on all latent selectors, there's no way to simplify the expression.

We will avoid computing this intractable marginal by instead employing an independently parameterised inference model.
This inference model $Q(z|x, y, \lambda)$ is an approximation to the true postrerior $P(z|x, y, \theta, p_1)$, and we use $\lambda$ to denote its parameters.


We make a *mean field* assumption, whereby we model latent variables independently given the input:
\begin{align}
Q(z|x, y, \lambda) 
    &= \prod_{i=1}^{n} Q(z_i|x; \lambda) \\
    &= \prod_{i=1}^{n} \text{Bern}(z_i|g_i(x; \lambda)) 
\end{align}

where $g(x; \lambda)$ is a NN that maps from $x = \langle x_1, \ldots, x_n\rangle$ to $n$ Bernoulli parameters, each of which, is a probability value (thus $0 < g_i(x; \lambda) < 1$).

Note that though we could condition on $y$ for approximate posterior inference, we are opportunistically leaving it out. This way, $Q$ is directly available at test time for making predictions. The figure below is a graphical depiction of the inference model (we show a dashed arrow from $y$ to $z$ to remind you that in principle the label is also available).

<img src="https://github.com/probabll/dgm4nlp/raw/master/notebooks/sst/img/inference.png"  height="200">

Here is an example design for $g$:
\begin{align}
\mathbf x_i &= \text{glove}(x_i) \\
\mathbf t_1^n, \mathbf h &= \text{encoder}(\mathbf x_1^n; \lambda_{\text{enc}}) \\
g_i(x; \lambda) &= \sigma(\text{dense}_1(\mathbf t_i; \lambda_{\text{output}}))
\end{align}
where
* $\text{glove}$ is a pre-trained embedding function;
* $\text{dense}_1$ is a dense layer with a single output;
* and $\sigma(\cdot)$ is the sigmoid function, necessary to parameterise a Bernoulli distribution.

Here we implement this product of Bernoulli distributions:

In [0]:
class ProductOfBernoullis(nn.Module):
    """
    This is an inference network that parameterises independent Bernoulli distributions.
    """

    def __init__(self,
                 embed:       nn.Embedding,
                 hidden_size: int = 200,
                 dropout:     float = 0.1,
                 layer:       str = "lstm"
                 ):

        super(ProductOfBernoullis, self).__init__()

        emb_size = embed.weight.shape[1]
        enc_size = hidden_size * 2

        self.embed_layer = nn.Sequential(embed)
        self.enc_layer = get_encoder(layer, emb_size, hidden_size)
        self.logit_layer = nn.Linear(enc_size, 1, bias=True)
        
        self.report_params()

    def report_params(self):
        count = 0
        for name, p in self.named_parameters():
            if p.requires_grad and "embed" not in name:
                count += np.prod(list(p.shape))
        print("{} #params: {}".format(self.__class__.__name__, count))

    def forward(self, x, mask) -> Bernoulli:
        """
        It takes a tensor of tokens (integers)
         and predicts a Bernoulli distribution for each position.
        
        :param x: [B, T]
        :param mask: [B, T]
        :returns: Bernoulli
        """

        # encode sentence
        # [B]
        lengths = mask.long().sum(1)
        # [B, T, E]
        emb = self.embed_layer(x)  
        # [B, T, d]
        h, _ = self.enc_layer(emb, mask, lengths)

        # compute parameters for Bernoulli p(z|x)
        # [B, T, 1] Bernoulli distributions
        logits = self.logit_layer(h)
        # [B, T]
        logits = logits.squeeze(-1)
        return Bernoulli(logits=logits)

## Parameter Estimation

In variational inference, our objective is to maximise the *evidence lowerbound* (ELBO):

\begin{align}
\log P(y|x) &\ge \mathbb E_{Q(z|x, y, \lambda)}\left[ \log P(y|x, z, \theta, p_1) \right] - \text{KL}(Q(z|x, y, \lambda) || P(z|p_1)) \\
&\overset{\text{MF}}{=}\mathbb E_{Q(z|x, y, \lambda)}\left[ \log P(y|x, z, \theta, p_1) \right] - \sum_{i=1}^n \text{KL}(Q(z_i|x, \lambda) || P(z_i|p_1)) 
\end{align}

where the *mean field* assumption we made implies that the KL term is simply a sum of KL divergences from a Bernoulli posterior to a Bernoulli prior.

Note that the ELBO remains intractable, namely, solving the expectation in closed form still requires $2^n$ evaluations of the classifier network. Though unlike the true posterior, $Q(z|x,y,\lambda)$ is known and can be used to obtain gradient estimates based on samples.

### Gradient of the classifier network

For the classifier, we encounter no problem:

\begin{align}
&\nabla_\theta\sum_{z} Q(z|x, \lambda)\log P(y|x,z,\theta) \\
&=\sum_{z} Q(z|x, \lambda)\nabla_\theta\log P(y|x,z,\theta) \\
&= \mathbb E_{Q(z|x, \lambda)}\left[\nabla_\theta\log P(y|x,z,\theta) \right] \\
&\overset{\text{MC}}{\approx} \frac{1}{J} \sum_{j=1}^J \nabla_\theta \log P(y|x, z^{(j)}, \theta) 
\end{align}
where $z^{(j)} \sim Q(z|x,\lambda)$.


### Gradient of the inference network

For the inference model, we have to use the *score function estimator* (a.k.a. REINFORCE):

\begin{align}
&\nabla_\lambda\sum_{z} Q(z|x, \lambda)\log P(y|x,z,\theta)\\
&=\sum_{z} \nabla_\lambda Q(z|x, \lambda)\log P(y|x,z,\theta) \\
&=\sum_{z}  \underbrace{Q(z|x, \lambda) \nabla_\lambda \log Q(z|x, \lambda)}_{\nabla_\lambda Q(z|x, \lambda)} \log P(y|x,z,\theta) \\
&= \mathbb E_{Q(z|x, \lambda)}\left[ \log P(y|x,z,\theta) \nabla_\lambda \log Q(z|x, \lambda) \right] \\
&\overset{\text{MC}}{\approx} \frac{1}{J} \sum_{j=1}^J  \log P(y|x, z^{(j)}, \theta) \nabla_\lambda \log Q(z^{(j)}|x, \lambda) 
\end{align}

where $z^{(j)} \sim Q(z|x,\lambda)$.

## Implementation

Let's implement the model and the loss (negative ELBO):

In [0]:
from torch.nn.functional import softplus
#from discrete.util import get_z_stats


class Model(nn.Module):
    """
    Reimplementation of Lei et al. (2016). Rationalizing Neural Predictions
    for Stanford Sentiment.
    (Does classfication instead of regression.)

    Consists of:
    - Encoder that computes p(y | x, z)
    - Generator that computes p(z | x) independently or dependently with an RNN.
    """

    def __init__(self,
                 vocab:       object = None,
                 vocab_size:  int = 0,
                 emb_size:    int = 200,
                 hidden_size: int = 200,
                 output_size: int = 1,
                 prior_p1:    float = 0.1,
                 dropout:     float = 0.1,
                 layer_cls:   str = 'pass',
                 layer_inf:   str = 'lstm',
                 ):

        super(Model, self).__init__()

        self.vocab = vocab
        self.embed = embed = nn.Embedding(vocab_size, emb_size, padding_idx=1)

        # TODO: rename to obs_model
        self.cls_net = Classifier(
            embed=embed, hidden_size=hidden_size, output_size=output_size,
            dropout=dropout, layer=layer_cls)
        
        # TODO: rename to q_z
        self.inference_net = ProductOfBernoullis(
            embed=embed, hidden_size=hidden_size,
            dropout=dropout, layer=layer_inf)
        
        self.prior_p1 = prior_p1

    def predict(self, py, **kwargs):
        """
        Predict deterministically.
        :param x:
        :return: predictions, optional (dict with optional statistics)
        """
        assert not self.training, "should be in eval mode for prediction"
        return py.log_probs.argmax(-1)

    def forward(self, x):
        """
        Generate a sequence of zs with the Generator.
        Then predict with sentence x (zeroed out with z) using Encoder.

        :param x: [B, T] (that is, batch-major is assumed)
        :return:
        """
        mask = (x != 1)  # [B,T]

        qz = self.inference_net(x, mask)

        if self.training:  # sample
            # [B, T]
            z = qz.sample()
        else:  # deterministic
            # [B, T]
            # TODO: consider this
            z = (qz.probs >= 0.5).float()
            #z = qz.sample()
            
        z = torch.where(mask, z, torch.zeros_like(z))
        
        py = self.cls_net(x, mask, z)
        return py, qz, z

    def get_loss(self, py, targets, 
                 q_z: Bernoulli, 
                 z, 
                 mask=None,
                 iter_i=0, 
                 kl_weight=1.0,
                 min_kl=0.0,
                 ll_mean=0.,
                 ll_std=1.,
                 **kwargs):
        """
        This computes the loss for the whole model.
        We stick to the variable names of the original code as much as
        possible.

        :param logits:
        :param targets:
        :param sparsity:
        :param coherent:
        :param mask:
        :param kwargs:
        :return:
        """
        assert mask is not None, "provide mask"

        lengths = mask.sum(1).float()
        batch_size = mask.size(0)
        terms = OrderedDict()

        # shape: [B]
        # log p(y|x,z) where z ~ q
        #one_hot_target = (targets.unsqueeze(-1) == torch.arange(5, device=device).reshape(1, 5)).float()            
        #ll = torch.sum(py.log_probs * one_hot_target, dim=-1)
        # [B]
        ll = py.log_pmf(targets)
        
        # KL(q||p)
        # [B, T]
        #prior_p1 = self.prior_p1
        #p_z = Bernoulli(probs=torch.full_like(q_z.probs, self.prior_p1))
        prior_p1 = np.random.beta(0.5, 0.5)
        p_z = Bernoulli(probs=torch.full_like(q_z.probs, prior_p1))
        kl = q_z.kl(p_z)
        kl = torch.where(mask, kl, torch.zeros_like(kl))
                
        # Compute the log density of the sample
        # [B, T]
        log_q_z = q_z.log_pmf(z)
        log_q_z = torch.where(mask, log_q_z, torch.zeros_like(log_q_z))
        # We have independent Bernoullis, thus we just sum their log probabilities
        # [B]
        log_q_z = log_q_z.sum(1)
        
        # surrogate objective for score function estimator
        # [B]
        reward = (ll.detach() - torch.full_like(ll, ll_mean)) / torch.full_like(ll, ll_std)
        sf_surrogate = (reward * log_q_z)

        # Make terms in the ELBO
        # []
        ll = ll.mean()
        sf_surrogate = sf_surrogate.mean()
        # KL may require annealing and free-bits
        # [B]
        kl = kl.sum(dim=-1)
        kl_fb = torch.max(torch.full_like(kl, min_kl), kl)
        # []
        kl = kl.mean() 
        kl_fb = kl_fb.mean() 
        kl_fb = kl_fb * kl_weight
        
        terms['elbo'] = (ll - kl_fb).item()
        terms['ll'] = ll.item()
        terms['kl_fb'] = kl_fb.item()
        terms['kl'] = kl.item()
        terms['kl_weight'] = kl_weight
        terms['sf'] = sf_surrogate.item()
        terms['reward'] = reward.mean().item()
        terms['ll_mean'] = ll_mean
        terms['ll_std'] = ll_std
        terms['selected'] = (z.sum(1) / lengths).mean().item()
        terms['prior_p1'] = prior_p1
        terms['avg_p1'] = (torch.where(mask, q_z.probs, torch.zeros_like(q_z.probs)).sum() / mask.sum().float()).item()
        # TODO log min and max p1 in batch (mask properly)
        return - ll - sf_surrogate + kl_fb, terms

In [0]:
from collections import deque

class MovingStats:
    
    def __init__(self, memory=-1):
        self.data = deque([])
        self.memory = memory
        
    def append(self, value):
        if self.memory != 0:
            if self.memory > 0 and len(self.data) == self.memory:
                self.data.popleft()
            self.data.append(value)
        
    def mean(self):
        if len(self.data):
            return np.mean([x for x in self.data])
        else:
            return 0.
    
    def std(self):
        return 1.  # np.std(self.data) if len(self.data) > 1 else 1.
            

# Training loop

In [0]:
from sst.util import make_kv_string, get_minibatch, prepare_minibatch, print_parameters

In [19]:
from sst.sstutil import examplereader, Vocabulary, load_glove
from collections import OrderedDict
import torch.optim
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau
import time
from sst.evaluate import evaluate


cfg = dict()

# Data
cfg['training_path'] = "sst/data/sst/train.txt"
cfg['dev_path'] = "sst/data/sst/dev.txt"
cfg['test_path'] = "sst/data/sst/test.txt"
cfg['word_vectors'] = 'sst/data/sst/glove.840B.300d.filtered.txt'
# Architecture
cfg['num_iterations'] = -20  # use negative for epochs and positive for iterations
cfg['print_every'] = 100
cfg['eval_every'] = -1
cfg['batch_size'] = 25
cfg['eval_batch_size'] = 25
cfg['subphrases'] = False
cfg['min_phrase_length'] = 2
cfg['lowercase'] = True
cfg['fix_emb'] = True
cfg['embed_size'] = 300
cfg['hidden_size'] = 150
cfg['num_layers'] = 1
cfg['dropout'] = 0.5
cfg['layer_inf'] = 'pass'
cfg['layer_cls'] = 'pass'
cfg['save_path'] = 'data/results'
cfg['baseline_memory'] = 1000
cfg['prior_p1'] = 0.3
cfg['min_kl'] = 0.  # use more than 0 to enable free bits
cfg['kl_weight'] = 1.  # start from zero to enable annealing
cfg['kl_inc'] = 0.00001  
# Optimiser
cfg['lr'] = 0.0002
cfg['weight_decay'] = 1e-5
cfg['lr_decay'] = 0.5
cfg['patience'] = 5
cfg['cooldown'] = 5
cfg['threshold'] = 1e-4
cfg['min_lr'] = 1e-5
cfg['max_grad_norm'] = 5.


print('# Configuration')
for k, v in cfg.items():
    print("{:20} : {:10}".format(k, v))
    
# Let's load the data into memory.
print("Loading data")
train_data = list(examplereader(
    cfg['training_path'],
    lower=cfg['lowercase'], 
    subphrases=cfg['subphrases'],
    min_length=cfg['min_phrase_length']))
dev_data = list(examplereader(cfg['dev_path'], lower=cfg['lowercase']))
test_data = list(examplereader(cfg['test_path'], lower=cfg['lowercase']))

print("train", len(train_data))
print("dev", len(dev_data))
print("test", len(test_data))

iters_per_epoch = len(train_data) // cfg["batch_size"]

if cfg["eval_every"] == -1:
    eval_every = iters_per_epoch
    print("Set eval_every to {}".format(iters_per_epoch))

if cfg["num_iterations"] < 0:
    num_iterations = iters_per_epoch * -1 * cfg["num_iterations"]
    print("Set num_iterations to {}".format(num_iterations))

print('\n# Example')
example = dev_data[0]
print("First dev example:", example)
print("First dev example tokens:", example.tokens)
print("First dev example label:", example.label)



# Configuration
training_path        : sst/data/sst/train.txt
dev_path             : sst/data/sst/dev.txt
test_path            : sst/data/sst/test.txt
word_vectors         : sst/data/sst/glove.840B.300d.filtered.txt
num_iterations       :        -20
print_every          :        100
eval_every           :         -1
batch_size           :         25
eval_batch_size      :         25
subphrases           :          0
min_phrase_length    :          2
lowercase            :          1
fix_emb              :          1
embed_size           :        300
hidden_size          :        150
num_layers           :          1
dropout              :        0.5
layer_inf            : pass      
layer_cls            : pass      
save_path            : data/results
baseline_memory      :       1000
prior_p1             :        0.3
min_kl               :        0.0
kl_weight            :        1.0
kl_inc               :      1e-05
lr                   :     0.0002
weight_decay         :      1e-05


In [0]:
def train():
    
    vocab = Vocabulary()  # populated by load_glove
    glove_path = cfg["word_vectors"]
    vectors = load_glove(glove_path, vocab)

    #writer = SummaryWriter(log_dir=cfg["save_path"])

    # Map the sentiment labels 0-4 to a more readable form (and the opposite)
    i2t = ["very negative", "negative", "neutral", "positive", "very positive"]
    t2i = OrderedDict({p: i for p, i in zip(i2t, range(len(i2t)))})


    print('\n# Constructing model')
    model = Model(
        vocab_size=len(vocab.w2i), 
        emb_size=cfg["embed_size"],
        hidden_size=cfg["hidden_size"], 
        output_size=len(t2i),
        prior_p1=cfg['prior_p1'],
        vocab=vocab, 
        dropout=cfg["dropout"], 
        layer_cls=cfg["layer_cls"],
        layer_inf=cfg["layer_inf"])

    print('\n# Loading embeddings')
    with torch.no_grad():
        model.embed.weight.data.copy_(torch.from_numpy(vectors))
        if cfg["fix_emb"]:
            print("fixed word embeddings")
            model.embed.weight.requires_grad = False
        model.embed.weight[1] = 0.  # padding zero


    optimizer = Adam(model.parameters(), lr=cfg["lr"],
                     weight_decay=cfg["weight_decay"])

    scheduler = ReduceLROnPlateau(
        optimizer, mode="min", factor=cfg["lr_decay"], patience=cfg["patience"],
        verbose=True, cooldown=cfg["cooldown"], threshold=cfg["threshold"],
        min_lr=cfg["min_lr"])

    iter_i = 0
    train_loss = 0.
    print_num = 0
    start = time.time()
    losses = []
    accuracies = []
    best_eval = 1.0e9
    best_iter = 0

    model = model.to(device)

    # print model
    print(model)
    print_parameters(model)

    batch_size = cfg['batch_size']
    eval_batch_size = cfg['eval_batch_size']
    print_every = cfg['print_every']

    kl_inc = cfg['kl_inc']
    kl_weight = cfg['kl_weight']
    min_kl = cfg['min_kl']
    ll_moving_stats = MovingStats(cfg['baseline_memory'])

    while True:  # when we run out of examples, shuffle and continue
        for batch in get_minibatch(train_data, batch_size=batch_size, shuffle=True):

            epoch = iter_i // iters_per_epoch

            # forward pass
            model.train()
            x, targets, _ = prepare_minibatch(batch, model.vocab, device=device)

            # with autograd.detect_anomaly():

            py, q_z, z = model(x)

            mask = (x != 1)
            # "KL annealing"
            kl_weight += kl_inc
            if kl_weight > 1.:
                kl_weight = 1.0
                
            loss, terms = model.get_loss(
                py, 
                targets, 
                q_z=q_z,
                z=z,
                mask=mask, 
                kl_weight=kl_weight,
                min_kl=min_kl,
                ll_mean=ll_moving_stats.mean(),
                ll_std=ll_moving_stats.std(),
                iter_i=iter_i)

            train_loss += loss.item()
            ll_moving_stats.append(terms['ll'])

            # backward pass
            model.zero_grad()  # erase previous gradients

            loss.backward()  # compute new gradients

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=cfg['max_grad_norm'])

            # update weights
            optimizer.step()

            print_num += 1
            iter_i += 1

            # print info
            if iter_i % print_every == 0:

                train_loss = train_loss / print_every
                #writer.add_scalar('data/train_loss', train_loss, iter_i)
                #for k, v in loss_optional.items():
                #    writer.add_scalar('data/'+k, v, iter_i)

                print_str = make_kv_string(terms)
                print("Epoch %r Iter %r loss=%.4f %s" %
                      (epoch, iter_i, train_loss, print_str))
                losses.append(train_loss)
                print_num = 0
                train_loss = 0.

            # evaluate
            if iter_i % eval_every == 0:

                dev_eval, rationales = evaluate(
                    model, dev_data, 
                    batch_size=eval_batch_size, 
                    device=device,
                    cfg=cfg, iter_i=iter_i)
                accuracies.append(dev_eval["acc"])

                #for k, v in dev_eval.items():
                #    writer.add_scalar('data/dev/'+k, v, iter_i)

                print("\n# epoch %r iter %r: dev %s" % (
                    epoch, iter_i, make_kv_string(dev_eval)))
                
                for exid in range(3):
                    print(' dev%d [gold=%d,pred=%d]:' % (exid, dev_data[exid].label, rationales[exid][1]),  
                          ' '.join(rationales[exid][0]))
                print()

                #test_eval = evaluate(
                #    model, test_data, batch_size=eval_batch_size, device=device,
                #    cfg=cfg, iter_i=iter_i)
                #for k, v in test_eval.items():
                #    writer.add_scalar('data/test/'+k, v, iter_i)

                #print("# epoch %r iter %r: tst %s" % (
                #    epoch, iter_i, make_kv_string(test_eval)))

                # adjust learning rate

                scheduler.step(dev_eval["loss"])

In [21]:
train()


# Constructing model
Classifier #params: 1505
ProductOfBernoullis #params: 301

# Loading embeddings
fixed word embeddings
Model(
  (embed): Embedding(20727, 300, padding_idx=1)
  (cls_net): Classifier(
    (embed_layer): Sequential(
      (0): Embedding(20727, 300, padding_idx=1)
      (1): Dropout(p=0.5)
    )
    (enc_layer): Passthrough()
    (output_layer): Sequential(
      (0): Dropout(p=0.5)
      (1): Linear(in_features=300, out_features=5, bias=True)
      (2): LogSoftmax()
    )
  )
  (inference_net): ProductOfBernoullis(
    (embed_layer): Sequential(
      (0): Embedding(20727, 300, padding_idx=1)
    )
    (enc_layer): Passthrough()
    (logit_layer): Linear(in_features=300, out_features=1, bias=True)
  )
)
embed.weight             [20727, 300] requires_grad=False
cls_net.output_layer.1.weight [5, 300]     requires_grad=True
cls_net.output_layer.1.bias [5]          requires_grad=True
inference_net.logit_layer.weight [1, 300]     requires_grad=True
inference_net.logit_lay

KeyboardInterrupt: ignored