# Character-Aware Neural Language Models

A summary and demonstration by Nicholas Farn. contact: <nfarn@g.ucla.edu>

This notebook will describe a "Character-Aware Neural Language Model". At its simplest level, it is an amalgamation of a convolutional neural network, a highway network, and a long short term memory recurrent neural network. The CNN takes the characters of a given word as input, then combines its output with a highway network which is then fed into the LSTM. The LSTM then produces a word-level prediction. The model is trained on the Penn Treebank, as sample of which is imported below.

In [None]:
import tensorflow as tf
import numpy as np

from itertools import islice

# create init functions
weight_init = lambda shape : tf.Variable(tf.truncated_normal(shape, stddev=0.1))
bias_init = lambda shape : tf.Variable(tf.constant(0.1, shape=shape))
conv_init = lambda x, W : tf.layer.conv2d(x, W, strides=[1,1,1,1], padding='VALID')

# import training, testing, and validation data sets
# characters are encoded with most frequent as 1
batch_size = 100

![alt text](character-model.png "Architecture")
<p style='text-align: center;'><b>Figure 1:</b> Example Architecture of Character-Aware Neural Network</p>

## Character-level Convolutional Neural Network

As we can see from <b>Figure 1</b>, the base layers is a convolutional neural network. The cNN takes a matrix $\mathbf{C}^k \in \mathbb{R}^{d \times l}$ where $l$ is the character length of a word, $d$ is the dimension of the chosen character embedding, and $k$ denotes the word position. The character embeddings are determined by a matrix $\mathbf{Q} \in \mathbb{R}^{d \times \vert \mathcal{C} \vert}$ where $\mathcal{C}$ is the vocabulary of characters. Furthermore a "start" and "end" character are added to each word, as well as padding to create uniform sizes for batch training.

In [None]:
padded_size = max(len(words))
char_size = unique(data)
input_dim = 15

# holds a sentence
char_ids = tf.placeholder([-1, padded_size, 1, None])

# create character embedding
char_embedding = tf.get_variable("character_embeddings", [char_size, input_dim])
embedded_char_ids = tf.nn.embedding_lookup(char_embeddings, char_ids)

# add padding


Column-wise filters of varying width, $\mathbf{H}_i$, are then applied to $\mathbf{C}^k$ resulting in a variety of convolutions of varying dimensions. The temporal max is taken individually from each convolution to create a uniformly sized output. With varying filters, the idea is to capture the most prominent n-grams within a word. Given a series of filters, $\mathbf{H}_1, \dots, \mathbf{H}_n$, and their widths, $w_i$, the output, $\mathbf{y}^k$, can be calculated as the following:

\begin{align}
\mathbf{y}^k &= [y_1^k, \dots, y_n^k] \\
y_i^k &= \max_j \mathbf{f}^k [j] \\
\mathbf{f}^k [j] &= \tanh(\langle C^k[:, j:j + w_i - 1], \mathbf{H_i} \rangle + b_i) \\
\langle \mathbf{A}, \mathbf{B} \rangle &= \text{Tr}(\mathbf{AB}^T)
\end{align}

and is implemented below using python 3 and tensorflow:

In [None]:
# set input and filter dimensions
filter_widths = np.arange(1,7)
cnn_input = embedded_char_ids

# set filters and biases
cnn_kernels = [
    weight_init([input_dim, width, 1, 25*width]) for width in filter_widths
]
cnn_biases = [
    bias_init([25*width]) for width in filter_widths
]

# achieves same effect as trace
cnn_convs = [
    tf.tanh(conv_init(cnn_input, kernel) + bias)
    for kernel, bias in zip(cnn_kernels, cnn_biases)
]

# combine max output into one tensor, reshape into array
cnn_output = tf.concat([reduce_max(conv, axis=1) for conv in cNN_convs], 0)

## Highway Network

The resultant output from the cNN could be fed directly into the LSTM, however instead it is run through a highway network. A highway network introduces an adaptive gate that can adaptively carry some input while throwing out others. The highway network is completely described below, where $\circ$ represents element-wise multiplication and $\mathbf{W}_H$ and $\mathbf{W}_T$ are square matrices in order to give $\mathbf{z}$ the same dimension as $\mathbf{y}$. Furthermore, $\mathbf{t}$ is described as a transform gate and $1 - \mathbf{t}$ is known as the carry gate.

\begin{align}
\mathbf{z} &= \mathbf{t} \circ g(\mathbf{W}_H y + \mathbf{b}_H) + (1 - \mathbf{t}) \circ \mathbf{y} \\
\mathbf{t} &= \sigma(\mathbf{W}_T \mathbf{y} + \mathbf{b}_T)\\
g(x) &= \max(0, x) \\
\sigma(x) &= \frac{1}{1 + e^{-x}}
\end{align}

A highway network is noted to improve the results compared to feeding the output directly into the LSTM. If the cNN can be seen as extracting the most significant n-grams characters in a word, a highway network can be seen as tossing out certain n-grams which are useless in the context of others. The highway network is implemented below. Direct cNN input and highway input will be  compared later.

In [None]:
hwy_input = cnn_output
N = tf.size(highway_input) # make sure getting right dim

weight_T = weight_init([N, N])
weight_H = weight_init([N, N])
bias_T = bias_init([N])
bias_H = bias_init([N])

trans_gate = tf.sigmoid(tf.matmul(weight_T, highway_input) + bias_T)
trans_output = tf.multiply(trans_gate, tf.nn.relu(tf.matmul(weight_H, highway_input)) + bias_H)
carry_output = tf.multiply(1 - trans_gate, hwy_input)
hwy_output = trans_output + carry_output

## Recurrent Neural Network

The specific model is described by the following equations, where $\circ$ is element-wise multiplication and $\sigma$ is a sigmoid function. Additionally, $i_t$, $f_t$, and $o_t$ are the <i>input</i>, <i>forget</i>, and <i>output</i> gates respectively at time-step $t$. $h_t$ and $c_t$ are the hidden and cell vectors and are zero-vectors when $t = 0$.

\begin{aligned}
\mathbf{i}_t &= \sigma(\mathbf{W}^i \mathbf{x}_t + \mathbf{U}^i \mathbf{h}_{t-1} + \mathbf{b}^i) \\
\mathbf{f}_t &= \sigma(\mathbf{W}^f \mathbf{x}_t + \mathbf{U}^f \mathbf{h}_{t-1} + \mathbf{b}^f) \\
\mathbf{i}_o &= \sigma(\mathbf{W}^o \mathbf{x}_t + \mathbf{U}^o \mathbf{h}_{t-1} + \mathbf{b}^o) \\
\mathbf{g}_t &= \tanh(\mathbf{W}^g \mathbf{x}_t + \mathbf{U}^g \mathbf{h}_{t-1} + \mathbf{b}^g) \\
\mathbf{c}_t &= \mathbf{f}_t \circ \mathbf{c}_{t-1} + \mathbf{i}_t \circ \mathbf{g}_t \\
\mathbf{h}_t &= \mathbf{o}_t \circ \tanh(\mathbf{c}_t)
\end{aligned}

Output at time $t$ is achieved by taking the softmax after an affine transformation to the hidden output at time $t$, $\mathbf{h}_t$. This creates a probability distribution over all possible words.

$$\Pr(w_{t+1} = j | w_{1:t}) = \frac{\exp(\mathbf{h}_t \cdot \mathbf{p}^j + q^j)}{\sum_{j' \in \mathcal{V}} \exp(\mathbf{h}_t \cdot \mathbf{p}^{j'} + q^{j'})}$$

Where $\mathbf{p}^j$ is the $j$-th column of $\mathbf{P} \in \mathbb{R}^{m \times \vert \mathcal{V} \vert}$, an output embedding matrix and $q^j$ is a bias term. Here $\mathcal{V}$ is simply our vocabulary of words. An lstm taking direct input from the cNN and input from the highway network are both implemented below.

In [None]:
lstm_input = highway_output
N = tf.size(lstm_input)



## Training

The models are trained through truncated backpropagation up to 35 time steps. The models are rated by perplexity, the exponent of the averaged negative log likelihood of seeing a sequence of words. These are defined below.

\begin{align}
NLL &= -\sum_{t=1}^T \log \Pr(w_t | w_{1:t-1}) \\
PPL &= \exp(\frac{NLL}{T})
\end{align}

The learning rate is initially set to 1.0 and is halved if the perplexity is not decreased by 1.0 per training epoch. In addition, the model is regularized using dropout and the gradient is renormalized so that its $L_2$ norm is less than or equal to 5.

## References