# Models

## Composing vectors


Use this notation
$$ \mathbf{p} = \text{comp}(u, v, K) $$
to represent the composition $\mathbf{p}$ of two objects $\mathbf{u}, \mathbf{v}$, where $K$ represents some external information used in the composition operation (Mitchell & Lapata 2010).

Two simplistic composition operations over vectors which leave $K$ empty are the *additive model*
$$ \text{comp}_+(\mathbf{u}, \mathbf{v}) = \mathbf{u} + \mathbf{v} $$
and the *multiplicative model*
$$ \text{comp}_\odot(\mathbf{u}, \mathbf{v}) = \mathbf{u}\odot\mathbf{v} $$
where $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$, and $\odot$ is component-wise product.

These are not suitable for representing compositions in natural language semantics, because addition and multiplication are *commutative*
$$ a + b = b + a $$
$$ a \odot b = b \odot a $$
But semantic compositions in natural language are not commutative. Consider the fact that these sentences "Suzy drank wine" and "Wine drank Suzy" have different meanings. Yet, they would have the same meaning:
$$ v_{Suzy} + v_{drank} + v_{wine} = v_{wine} + v_{drank} + v_{Suzy} $$
$$ v_{Suzy} \odot v_{drank} \odot v_{wine} = v_{wine} \odot v_{drank} \odot v_{Suzy} $$

Since they do not take into account word order, the additive and multiplicative models are *bag of words* models.

Consider a third model, in which we use the Kronecker product (this goes back to Smolensky 1990, 2006; and investigated further in Grefenstette 2013, Grefenstette & Sadrzadeh 2011):
$$ \text{comp}_\otimes(\mathbf{u}, \mathbf{v}) = \mathbf{u} \otimes \mathbf{v} $$
The Kronecker product is a generalization of the outer-product 
$$ [\mathbf{u} \otimes \mathbf{v}]_{ij} = u_iv_j $$
so that if $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$ then their Kronecker product is a $d \times d$ matrix:
$$ \mathbf{u} \otimes \mathbf{v} = \begin{bmatrix}
    u_1v_1 \dots u_1v_d \\
    \vdots \ddots \vdots \\
    u_dv_1 \dots u_dv_d
\end{bmatrix} $$

The outer product is not commutative; i.e., it is possible that:
$$ [\mathbf{u} \otimes \mathbf{v}]_{ij} = u_iv_j \neq v_iu_j = [\mathbf{v} \otimes \mathbf{u}]_{ij} $$
It can be shown that non-commutativity also holds for Kronecker products.

However, it can also be shown that the Kronecker product is *associative*
$$ (\mathbf{u} \otimes \mathbf{v}) \otimes \mathbf{w} = \mathbf{u} \otimes (\mathbf{v}  \otimes \mathbf{w}) $$

Natural language is also certainly not associative, because natural language semantics is sensitive to hierarchical structure. For example, "Lisa looked at the elephant with the telescope" has two meanings given two different parses:
1. (Lisa (saw (the (elephant (with (the telescope))))))
2. (Lisa ((saw (the (elephant))) (with (the (telescope)))))

This model also has the disadvantage that each application of the Kronecker product gives a tensor of higher order than its inputs, causing dimensionality to possibly grow quite large.

We can distinguish three levels of complexity:
1. **Bag of words.** Neither word-order nor hierarchical structure [additive, multiplicative models].
2. **Sequential.** Word-order but no hierarchical structure [Kronecker model]. (**regular**)
3. **Hierarchical.** Both word-order and hierarchical structure. (**context-free**)

### Section references:
- Mitchell and Lapata (2010) "Vector-based Models of Semantic Composition"
- chapter 2 of Grefenstette (2013)

## Neural models with explicit compositional structure

These models assume vector representations of each word in an input sequence, and use RNNs and variants of RNNs to learn composition functions on vectors. Depending on the task, the vector representations of words may be pre-trained, trained by the model, or pre-trained and then tuned by the model.

### Recursive Neural Network (RNN)

An input sequence $\mathbf{x}$ of vectors (representing a sequence of words) is parsed into a binary tree structure.  Given two vectors $\mathbf{a}, \mathbf{b}$, the vector of the parent node $\mathbf{p}$ is the output of the composition function
$$ \mathbf{p} = \text{comp}_{\text{RNN}}(\mathbf{a}, \mathbf{b}, K). $$

Socher et al. (2010) give this composition function:
$$ \text{comp}_{\text{RNN}}(\mathbf{a}, \mathbf{b}, \mathbf{W}) = f\left(\begin{bmatrix}\mathbf{a}\\\mathbf{b}\end{bmatrix}\mathbf{W}\right) $$
where $\mathbf{a}, \mathbf{b} \in \mathbb{R}^d, \mathbf{W} \in \mathbb{R}^{2d\times d}$. $\mathbf{W}$ is the parameter to learn and is fixed for all compositions of $\mathbf{a}$ and $\mathbf{b}$. $f$ is the activation function, usually $\tanh$. The model may also include a bias term, not shown above.

<img src="images/socher2010fig1.png" />

For a classification task, the softmax classifier is applied either to the root node or to each node $\mathbf{a}$ in the tree (dependending on the structure of your training data)
$$ \mathbf{y^{(a)}} = \text{softmax}(\mathbf{a}\mathbf{W}_s) $$
where $\mathbf{W}_s \in \mathbb{R}^{d\times m}$ for an $m$-way classification task.

**Algebraic properties.** When $f = \tanh$, $\text{comp}_{\text{RNN}}$ is neither commutative nor associative.

### Matrix-Vector RNN (MV-RNN)

The RNN applies the same composition function to every pair of vectors. There is reason to believe that different vectors compose in different ways (e.g., intersective vs. non-intersective adjectives, modification vs. function application).

An extension of an RNN, so that there can be a different composition function for each pair of children.
Given two vectors $\mathbf{a}, \mathbf{b}$ they are associated respectively with the matrices $\mathbf{A}, \mathbf{B}$. So each node is represented by a pair of a vector and a matrix.

The composition function is
$$ \text{comp}_{\text{MV-RNN}}(\langle\mathbf{a}, \mathbf{A}\rangle, \langle\mathbf{b}, \mathbf{B}\rangle,  \mathbf{W}) = 
\left\langle f\left(\begin{bmatrix}\mathbf{Ba}\\\mathbf{Ab}\end{bmatrix}\mathbf{W}\right),
\begin{bmatrix}\mathbf{A}\\\mathbf{B}\end{bmatrix}\mathbf{W}_M \right\rangle$$
where $\mathbf{a}, \mathbf{b} \in \mathbb{R}^d, \mathbf{A}, \mathbf{B} \in \mathbb{R}^{d \times d}, \mathbf{W} \in \mathbb{R}^{2d\times d}$, and $\mathbf{W}_M \in \mathbb{R}^{2d\times  d}$.

Now, the parameters to learn are $\mathbf{W}, \mathbf{W}_M$ and the matrices $\mathbf{X}$ for all leaf vectors $\mathbf{x}$.

<img src="images/socher2012fig2.png" />

**Algebraic properties.** This is just a more general model than the previous model, and it is easy to see that it is also non-commutative and non-associative.

### Other models

- RNTN (Recursive Neural Tensor Network) (Socher et al. 2013): similar to MV-RNNs, but with a fixed number of parameters to learn.
- FCN (Forest Convolutional Network) (Le & Zuidema 2015): extends RNN to sets of trees (forests) to capture ambiguity, and performs pooling over multiple parses of a constituent to obtain composed vectors.
- Tree LSTM (Tai, Socher, & Manning 2015): inspired by LSTMs and RNNs. Uses an LSTM-like architecture over trees, rather than sequences. This amounts to stating a much more complex composition function over the trees.