# Character-Aware Neural Language Models

A summary and demonstration by Nicholas Farn. contact: <nfarn@g.ucla.edu>

This notebook will describe a "Character-Aware Neural Language Model". At its simplest level, it is an amalgamation of a convolutional neural network, a highway network, and a long short term memory recurrent neural network. The CNN takes the characters of a given word as input, then combines its output with a highway network which is then fed into the LSTM. The LSTM then produces a word-level prediction. The model is trained on the Penn Treebank, as sample of which is imported below. A representation of the model's architecture can be viewed in <b>Figure 1</b>.

In [None]:
import tensorflow as tf
import numpy as np
import json

# import character and word representations
batch_size = 100
char_table = json.loads(open('data/chars', 'r').read())
word_table = json.loads(open('data/words', 'r').read())

# import training, testing, and validation sets

![alt text](character-model.png "Architecture")
<p style='text-align: center;'><b>Figure 1:</b> Example Architecture of Character-Aware Neural Network</p>

## Character-level Convolutional Neural Network

As we can see from <b>Figure 1</b>, the base layers is a convolutional neural network. The cNN takes the characters in a word as input, this can be reprented as a vector $\mathbf{w}_k$, the $k$-th word in a sequence, with character $c_{kj}$, the id of the $j$-th character in word $k$. Since each word has variable length, each word is padded to a uniform length equal to the length of the longest word. Each word also has a start and end character prepended and appended to it before padding, which aids accuracy of the model. Additionally each sequence is padded to a length of 35, since the LSTM is trained using truncated backpropagation up to 35 time-steps. This is discuseed in more detail later. Both of these changes are to increase ease during batch training.

In [None]:
max_word_len = len(max(word_table.keys(), key=len))
seq_len = 35

char_input = tf.placeholder(tf.int32, [batch_size, seq_len, max_word_len])

Each character is then embedded through the use of a matrix $\mathbf{Q} \in \mathbb{R}^{d \times \vert \mathcal{C} \vert}$, where $\mathcal{C}$ is the vocabulary of characters and $d$ is the dimension of the embeddings, in this case 15. Thus the input is converted into a matrix $\mathbf{C}^k \in \mathbb{R}^{d \times l}$ where $l$ is length of the longest word.

In [None]:
char_vocab_size = len(char_table)
embed_dim = 15

char_embeddings = tf.get_variable("char_embed", [char_vocab_size, embed_dim])

Kernels of varying width are applied along word length, with a kernel with a width $i$ having a kernel $\mathbf{H}_i \in \mathbb{R}^{d \times i}$. The output convolution for a kernel $\mathbf{H}_i$ is then placed through a tanh activation and then a max pool also along word length to learn the most significant filters. The resultant values are then combined into a single vector, resulting in a uniform output vector size.

The idea behind varying kernel widths is to capture the most significant n-grams for a given word. Thus the cNN could potentially learn that the trigram "foo" is important in the word <b>foo</b>bar. The kernel widths are chosen to be of sizes 1 to 7 with filters of size 50 times width up to a max of 200 filters. The specific equations are defined and implemented below.

\begin{align}
\mathbf{y}^k &= [y_1^k, \dots, y_n^k] \\
y_i^k &= \max_j \mathbf{f}^k [j] \\
\mathbf{f}^k [j] &= \tanh(\langle C^k[:, j:j + w_i - 1], \mathbf{H_i} \rangle + b_i) \\
\langle \mathbf{A}, \mathbf{B} \rangle &= \text{Tr}(\mathbf{AB}^T)
\end{align}

In [None]:
# create init functions
weight_init = lambda shape : tf.Variable(tf.truncated_normal(shape, stddev=0.1))
bias_init = lambda shape : tf.Variable(tf.constant(0.1, shape=shape))
conv_init = lambda x, W : tf.nn.conv2d(x, W, strides=[1,1,1,1], padding='VALID')

# set input and filter dimensions
kernel_widths = np.arange(1,8)

# set filters and biases
cnn_kernels = [
    weight_init([1, width, embed_dim, min(200, 50*width)]) for width in kernel_widths
]
cnn_biases = [
    bias_init([min(200, 50*width)]) for width in kernel_widths
]

# combine max output into one tensor, reshape into array
cnn_outputs = list()
char_indices = tf.split(char_input, seq_len, 1)
for i in xrange(seq_len):
    # get individual word, embed characters
    char_embed = tf.nn.embedding_lookup(char_embeddings, char_indices[i])
    
    # create convolutions, combine results to uniformly sized vector
    layers = list()
    for width, kernel, bias in zip(*[kernel_widths, cnn_kernels, cnn_biases]):
        conv = tf.tanh(conv_init(char_embed, kernel) + bias)
        pool = tf.nn.max_pool(conv, [1, 1, max_word_len - width + 1, 1], [1, 1, 1, 1], 'VALID')
        layers.append(tf.squeeze(pool))
    
    cnn_outputs.append(tf.concat(layers, 1))

## Highway Network

The resultant output from the cNN could be fed directly into the LSTM, however instead it is run through a highway network. A highway network introduces an adaptive gate that can adaptively carry some input while throwing out others. The highway network is completely described below, where $\circ$ represents element-wise multiplication and $\mathbf{W}_H$ and $\mathbf{W}_T$ are square matrices in order to give $\mathbf{z}$ the same dimension as $\mathbf{y}$. Furthermore, $\mathbf{t}$ is described as a transform gate and $1 - \mathbf{t}$ is known as the carry gate.

\begin{align}
\mathbf{z} &= \mathbf{t} \circ g(\mathbf{W}_H y + \mathbf{b}_H) + (1 - \mathbf{t}) \circ \mathbf{y} \\
\mathbf{t} &= \sigma(\mathbf{W}_T \mathbf{y} + \mathbf{b}_T)\\
g(x) &= \max(0, x) \\
\sigma(x) &= \frac{1}{1 + e^{-x}}
\end{align}

A highway network is noted to improve the results compared to feeding the output directly into the LSTM. If the cNN can be seen as extracting the most significant n-grams characters in a word, a highway network can be seen as tossing out certain n-grams which are useless in the context of others. The highway network is implemented below. Direct cNN input and highway input will be  compared later.

In [None]:
hwy_inputs = cnn_outputs
N = sum([min(200, 50*width) for width in kernel_widths])

# initialize highway weights and biases
weight_T = weight_init([N, N])
weight_H = weight_init([N, N])
bias_T = bias_init([N])
bias_H = bias_init([N])

# compute new output
hwy_outputs = list()
for hwy_input in hwy_inputs:
    trans_gate = tf.sigmoid(tf.matmul(hwy_input, weight_T) + bias_T)
    trans_output = tf.multiply(trans_gate, tf.nn.relu(tf.matmul(hwy_input, weight_H)) + bias_H)
    carry_output = tf.multiply(1 - trans_gate, hwy_input)
    hwy_outputs.append(trans_output + carry_output)

## Recurrent Neural Network

The recurrent neural network is a simply 2 layer LSTM. The specific model is described by the following equations, where $\sigma$ is a sigmoid function. Additionally, $\mathbf{i}_t$, $\mathbf{f}_t$, and $\mathbf{o}_t$ are the <i>input</i>, <i>forget</i>, and <i>output</i> gates respectively at time-step $t$. $\mathbf{h}_t$ and $\mathbf{c}_t$ are the hidden and cell vectors and are zero-vectors when $t = 0$. The memory cell is chosen to have a dimension of 600.

\begin{aligned}
\mathbf{i}_t &= \sigma(\mathbf{W}^i \mathbf{x}_t + \mathbf{U}^i \mathbf{h}_{t-1} + \mathbf{b}^i) \\
\mathbf{f}_t &= \sigma(\mathbf{W}^f \mathbf{x}_t + \mathbf{U}^f \mathbf{h}_{t-1} + \mathbf{b}^f) \\
\mathbf{i}_o &= \sigma(\mathbf{W}^o \mathbf{x}_t + \mathbf{U}^o \mathbf{h}_{t-1} + \mathbf{b}^o) \\
\mathbf{g}_t &= \tanh(\mathbf{W}^g \mathbf{x}_t + \mathbf{U}^g \mathbf{h}_{t-1} + \mathbf{b}^g) \\
\mathbf{c}_t &= \mathbf{f}_t \circ \mathbf{c}_{t-1} + \mathbf{i}_t \circ \mathbf{g}_t \\
\mathbf{h}_t &= \mathbf{o}_t \circ \tanh(\mathbf{c}_t)
\end{aligned}

In [None]:
# implement both lstm cells, direct and highway input
lstm_inputs = hwy_outputs
M = 600

lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(M)

lstm_outputs = list()

Output at time $t$ is achieved by taking the softmax after an affine transformation to the hidden output at time $t$, $\mathbf{h}_t$. This creates a probability distribution over all possible words.

$$\Pr(w_{t+1} = j | w_{1:t}) = \frac{\exp(\mathbf{h}_t \cdot \mathbf{p}^j + q^j)}{\sum_{j' \in \mathcal{V}} \exp(\mathbf{h}_t \cdot \mathbf{p}^{j'} + q^{j'})}$$

Where $\mathbf{p}^j$ is the $j$-th column of $\mathbf{P} \in \mathbb{R}^{m \times \vert \mathcal{V} \vert}$, an output embedding matrix and $q^j$ is a bias term. Here $\mathcal{V}$ is simply our vocabulary of words. An lstm taking direct input from the cNN and input from the highway network are both implemented below.

In [None]:
# placeholder for true outputs
# embedding to create prediction

## Training

The models are trained through truncated backpropagation up to 35 time steps. The models are rated by perplexity, the exponent of the averaged negative log likelihood of seeing a sequence of words. These are defined below.

\begin{align}
NLL &= -\sum_{t=1}^T \log \Pr(w_t | w_{1:t-1}) \\
PPL &= \exp(\frac{NLL}{T})
\end{align}

The learning rate is initially set to 1.0 and is halved if the perplexity is not decreased by 1.0 per training epoch. In addition, the model is regularized using dropout and the gradient is renormalized so that its $L_2$ norm is less than or equal to 5.

In [1]:
# implement training and loss function
# implement checks for adjusting training rate and gradient normalization

## Conclusion