# Introduction

In this notebook we will reproduce the results of [Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567). The core of the system is a bidirectional recurrent neural network (BRNN) trained to ingest speech spectrograms and generate English text transcriptions.

 Let a single utterance $x$ and label $y$ be sampled from a training set $S = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), . . .\}$. Each utterance, $x^{(i)}$ is a time-series of length $T^{(i)}$ where every time-slice is a vector of audio features, $x^{(i)}_t$ where $t=1,\ldots,T^{(i)}$. We use MFCC as our features; so $x^{(i)}_{t,p}$ denotes the $p$-th MFCC feature in the audio frame at time $t$. The goal of our BRNN is to convert an input sequence $x$ into a sequence of character probabilities for the transcription $y$, with $\hat{y}_t =\mathbb{P}(c_t \mid x)$, where $c_t \in \{a,b,c, . . . , z, space, apostrophe, blank\}$. (The significance of $blank$ will be explained below.)

Our BRNN model is composed of $5$ layers of hidden units. For an input $x$, the hidden units at layer $l$ are denoted $h^{(l)}$ with the convention that $h^{(0)}$ is the input. The first three layers are not recurrent. For the first layer, at each time $t$, the output depends on the MFCC frame $x_t$ along with a context of $C$ frames on each side. (We typically use $C \in \{5, 7, 9\}$ for our experiments.) The remaining non-recurrent layers operate on independent data for each time step. Thus, for each time $t$, the first $3$ layers are computed by:

$$h^{(l)}_t = g(W^{(l)} h^{(l-1)}_t + b^{(l)})$$

where $g(z) = \min\{\max\{0, z\}, 20\}$ is a clipped rectified-linear (ReLu) activation function and $W^{(l)}$, $b^{(l)}$ are the weight matrix and bias parameters for layer $l$. The fourth layer is a bidirectional recurrent layer[[1](http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf)]. This layer includes two sets of hidden units: a set with forward recurrence, $h^{(f)}$, and a set with backward recurrence $h^{(b)}$:

$$h^{(f)}_t = g(W^{(4)} h^{(3)}_t + W^{(f)}_r h^{(f)}_{t-1} + b^{(4)})$$
$$h^{(b)}_t = g(W^{(4)} h^{(3)}_t + W^{(b)}_r h^{(b)}_{t+1} + b^{(4)})$$

Note that $h^{(f)}$ must be computed sequentially from $t = 1$ to $t = T^{(i)}$ for the $i$-th utterance, while
the units $h^{(b)}$ must be computed sequentially in reverse from $t = T^{(i)}$ to $t = 1$.

The fifth (non-recurrent) layer takes both the forward and backward units as inputs

$$h^{(5)} = g(W^{(5)} h^{(4)} + b^{(5)})$$

where $h^{(4)} = h^{(f)} + h^{(b)}$. The output layer are standard logits that correspond to the predicted character probabilities for each time slice $t$ and character $k$ in the alphabet:

$$h^{(6)}_{t,k} = \hat{y}_{t,k} = (W^{(6)} h^{(5)}_t)_k + b^{(6)}_k$$

Here $b^{(6)}_k$ denotes the $k$-th bias and $(W^{(6)} h^{(5)}_t)_k$ the $k$-th element of the matrix product.

Once we have computed a prediction for $\hat{y}_{t,k}$, we compute the CTC loss[[2]](http://www.cs.toronto.edu/~graves/preprint.pdf) $\cal{L}(\hat{y}, y)$ to measure the error in prediction. During training, we can evaluate the gradient $\nabla \cal{L}(\hat{y}, y)$ with respect to the network outputs given the ground-truth character sequence $y$. From this point, computing the gradient with respect to all of the model parameters may be done via back-propagation through the rest of the network. We use the Adam method for training[[3](http://arxiv.org/abs/1412.6980)].

The complete BRNN model is illustrated in the figure below.

![DeepSpeech BRNN](images/rnn_fig-624x548.png)



# Preliminaries

## Imports

Here we first import all of the packages we require to implement the DeepSpeech BRNN.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.python.ops import ctc_ops

## Global Constants

Next we introduce several constants used in the algorithm below.  In particular, we define
* `learning_rate` - The learning rate we will employ in Adam optimizer[[3]](http://arxiv.org/abs/1412.6980)
* `training_iters`- The number of iterations we will train for
* `batch_size`- The number of elements in a batch
* `display_step`- The number of iterations we cycle through before displaying progress

In [2]:
learning_rate = 0.001   # TODO: Determine a reasonable value for this
beta1 = 0.9             # TODO: Determine a reasonable value for this
beta2 = 0.999           # TODO: Determine a reasonable value for this
epsilon = 1e-8          # TODO: Determine a reasonable value for this
training_iters = 5000   # TODO: Determine a reasonable value for this
batch_size = 1          # TODO: Determine a reasonable value for this
display_step = 1        # TODO: Determine a reasonable value for this

Note that we use the Adam optimizer[[3]](http://arxiv.org/abs/1412.6980) instead of Nesterov’s Accelerated Gradient [[4]](http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf) used in the original DeepSpeech paper, as, at the time of writing, TensorFlow does not have an implementation of Nesterov’s Accelerated Gradient [[4]](http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf).

As we will also employ dropout on the feedforward layers of the network, we need to define a parameter `dropout_rate` that keeps track of the dropout rate for these layers

In [3]:
dropout_rate = 0.05  # TODO: Validate this is a reasonable value

One more constant required of the non-recurrant layers is the clipping value of the ReLU. We capture that in the value of the variable `relu_clip`

In [4]:
relu_clip = 20 # TODO: Validate this is a reasonable value

## Geometric Constants

Now we will introduce several constants related to the geometry of the network.

The network views each speech sample as a sequence of time-slices $x^{(i)}_t$ of length $T^{(i)}$. As the speech samples vary in length, we know that $T^{(i)}$ need not equal $T^{(j)}$ for $i \ne j$. However, BRNN in TensorFlow are unable to deal with sequences with differing lengths. Thus, we must pad speech sample sequences with trailing zeros such that they are all of the same length. This common padded length is captured in the variable `n_steps` which will be set after the data set is loaded. 

Each of the `n_steps` vectors is MFCC features of a time-slice of the speech sample. We will make the number of MFCC features dependent upon the sample rate of the data set. Generically, if the sample rate is 8kHz we use 13 features. If the sample rate is 16kHz we use 26 features... We capture the dimension of these vectors, equivalently the number of MFCC features, in the variable `n_input`

In [5]:
n_input = 26 # TODO: Determine this programatically from the sample rate

As previously mentioned, the BRNN is not simply fed the MFCC features of a given time-slice. It is fed, in addition, a context of $C \in \{5, 7, 9\}$ frames on either side of the frame in question. The number of frames in this context is captured in the variable `n_context`

In [6]:
n_context = 5 # TODO: Determine the optimal value using a validation data set

Next we will introduce constants that specify the geometry of some of the non-recurrent layers of the network. We do this by simply specifying the number of units in each of the layers

In [7]:
n_hidden_1 = n_input + 2*n_input*n_context # Note: This value was not specified in the original paper
n_hidden_2 = n_input + 2*n_input*n_context # Note: This value was not specified in the original paper
n_hidden_5 = n_input + 2*n_input*n_context # Note: This value was not specified in the original paper

where `n_hidden_1` is the number of units in the first layer, `n_hidden_2` the number of units in the second, and  `n_hidden_5` the number in the fifth. We haven't forgotten about the third or sixth layer. We will define their unit count below.

A LSTM BRNN consists of a pair of LSTM RNN's. One LSTM RNN that works "forward in time"

<img src="images/LSTM3-chain.png" alt="LSTM" width="800">

and a second LSTM RNN that works "backwards in time"

<img src="images/LSTM3-chain.png" alt="LSTM" width="800">

The dimension of the cell state, the upper line connecting subsequent LSTM units, is independent of the input dimension and the same for both the forward and backward LSTM RNN.

Hence, we are free to choose the dimension of this cell state independent of the input dimension. We capture the cell state dimension in the variable `n_cell_dim`.

In [8]:
n_cell_dim = n_input + 2*n_input*n_context # TODO: Is this a reasonable value

The number of units in the third layer, which feeds in to the LSTM, is determined by `n_cell_dim` as follows

In [9]:
n_hidden_3 = 2 * n_cell_dim

Next, we introduce an additional variable `n_character` which holds the number of characters in the target language plus one, for the $blamk$. For English it is the cardinality of the set $\{a,b,c, . . . , z, space, apostrophe, blank\}$ we referred to earlier.

In [10]:
n_character = 29 # TODO: Determine if this should be extended with other punctuation

The number of units in the sixth layer is determined by `n_character` as follows 

In [11]:
n_hidden_6 = n_character

# Data Import

Next we will import the [TED-LIUM](http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus) data

In [12]:
from util.importers.ted_lium import read_data_sets
ted_lium = read_data_sets('./data/smoke_test', n_input, n_context)

Now that we have loaded the data we can set the `n_steps` paramater

In [13]:
n_steps = ted_lium.train.max_batch_seq_len

# Graph Creation

Next we concern ourselves with graph creation.

First we create several place holders in our graph. The first two `x` and `y` are placeholders for our training data pairs.

In [14]:
x = tf.placeholder("float", [None, n_steps, n_input + 2*n_input*n_context])
y = tf.sparse_placeholder(tf.int32)

The placeholder `y` represents the text transcript of each element in a batch. `y` is of type "SparseTensor" required by the CTC algorithm. The details of how the text transcripts are encoded in to a "SparseTensor" will be presented below.

The placeholder `x` is a place holder for the speech features along with their prefix and postfix contexts for each element in a batch. As it represents MFCC features, its type is "float". The `None` dimension of its shape

```python
[None, n_steps, n_input + 2*n_input*n_context]
```

is a 'placeholder' for the batch size. The `n_steps` dimension of its shape indicates the number of time-slices in the sequence. Finally, the `n_input + 2*n_input*n_context` dimension of its shape indicates the number of MFCC features `n_input` along with the number of MFCC features in the prefix-context `n_input*n_context` and postfix-contex `n_input*n_context`.

The next placeholder is for the sequence lengths of the elements in each batch

In [15]:
seq_len = tf.placeholder(tf.int32, [None])

The `None` dimension of the placeholder `seq_len`, as in the case of the placeholders `x` and `y`, is a 'placeholder' for the batch size. So, `seq_len` is a placeholder for a vector of 32 bit integers. Each one of these 32 bit integers holds the length of the corresponding element in the batch.

As we will be employing dropout on the feedforward layers of the network we will also introduce a placeholder `keep_prob` which is a placeholder for the dropout rate for the feedforward layers

In [16]:
keep_prob = tf.placeholder(tf.float32)

We will define the learned variables through two dictionaries. The first dictionary `weights` holds the learned weight variables. The second `biases` holds the learned bias variables.

The `weights` dictionary has the keys `'h1'`, `'h2'`, `'h3'`, `'h5'`, and `'h6'` each keyed against the values of the corresponding weight matrix. In particular, the first key `'h1'` is keyed against a value which is the learned weight matrix that converts an input vector of dimension `n_input + 2*n_input*n_context`  to a vector of dimension `n_hidden_1`. Similarly, the second key `'h2'` is keyed against a value which is the weight matrix converting an input vector of dimension `n_hidden_1` to one of dimension `n_hidden_2`. The keys `'h3'`, `'h5'`, and `'h6'` are similar. Likewise, the `biases` dictionary has biases for the various layers.

Concretely these dictionaries are given by

In [17]:
# Store layers weight & bias
# TODO: Is random_normal the best distribution to draw from?
weights = {
    'h1': tf.Variable(tf.random_normal([n_input + 2*n_input*n_context, n_hidden_1])),
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'h3': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_3])),
    'h5': tf.Variable(tf.random_normal([(2 * n_cell_dim), n_hidden_5])),
    'h6': tf.Variable(tf.random_normal([n_hidden_5, n_hidden_6]))
}
biases = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'b3': tf.Variable(tf.random_normal([n_hidden_3])),
    'b5': tf.Variable(tf.random_normal([n_hidden_5])),
    'b6': tf.Variable(tf.random_normal([n_hidden_6]))
}

Next we introduce a utility function `BiRNN` that can take the placeholder `x` along with the dictionaries `weights` and `biases` and add all the apropos operators to our default graph.

In [18]:
def BiRNN(_X, _weights, _biases):
    # Input shape: [batch_size, n_steps, n_input + 2*n_input*n_context]
    _X = tf.transpose(_X, [1, 0, 2])  # Permute n_steps and batch_size
    # Reshape to prepare input for first layer
    _X = tf.reshape(_X, [-1, n_input + 2*n_input*n_context]) # (n_steps*batch_size, n_input + 2*n_input*n_context)
    
    #Hidden layer with clipped RELU activation and dropout
    layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(_X, _weights['h1']), _biases['b1'])), relu_clip)
    layer_1 = tf.nn.dropout(layer_1, keep_prob)
    #Hidden layer with clipped RELU activation and dropout
    layer_2 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_1, _weights['h2']), _biases['b2'])), relu_clip)
    layer_2 = tf.nn.dropout(layer_2, keep_prob)
    #Hidden layer with clipped RELU activation and dropout
    layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, _weights['h3']), _biases['b3'])), relu_clip)
    layer_3 = tf.nn.dropout(layer_3, keep_prob)
    
    # Define lstm cells with tensorflow
    # Forward direction cell
    lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
    # Backward direction cell
    lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
    
    # Split data because rnn cell needs a list of inputs for the BRNN inner loop
    layer_3 = tf.split(0, n_steps, layer_3)
    
    # Get lstm cell output
    outputs, output_state_fw, output_state_bw = tf.nn.bidirectional_rnn(cell_fw=lstm_fw_cell,
                                                                        cell_bw=lstm_bw_cell,
                                                                        inputs=layer_3,
                                                                        dtype=tf.float32)
    
    # Reshape outputs from a list of n_steps tensors each of shape [batch_size, 2*n_cell_dim]
    # to a single tensor of shape [n_steps*batch_size, 2*n_cell_dim]
    outputs = tf.pack(outputs)
    outputs = tf.reshape(outputs, [-1, 2*n_cell_dim])
    
    #Hidden layer with clipped RELU activation and dropout
    layer_5 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(outputs, _weights['h5']), _biases['b5'])), relu_clip)
    layer_5 = tf.nn.dropout(layer_5, keep_prob)
    #Hidden layer of logits
    layer_6 = tf.add(tf.matmul(layer_5, _weights['h6']), _biases['b6'])
    
    # Reshape layer_6 from a tensor of shape [n_steps*batch_size, n_hidden_6]
    # to a tensor of shape [batch_size, n_steps, n_hidden_6]
    layer_6 = tf.reshape(layer_6, [n_steps, batch_size, n_hidden_6])
    layer_6 = tf.transpose(layer_6, [1, 0, 2])  # Permute n_steps and batch_size
    
    # Return layer_6
    return layer_6

The first few lines of the function `BiRNN`
```python
def BiRNN(_X, _weights, _biases):
    # Input shape: [batch_size, n_steps, n_input + 2*n_input*n_context]
    _X = tf.transpose(_X, [1, 0, 2])  # Permute n_steps and batch_size
    # Reshape to prepare input for first layer
    _X = tf.reshape(_X, [-1, n_input + 2*n_input*n_context])
    ...
```
reshape `_X` which has shape `[batch_size, n_steps, n_input + 2*n_input*n_context]` initially, to a tensor with shape `[n_steps*batch_size, n_input + 2*n_input*n_context]`. This is done to prepare the batch for input into the first layer which expects a tensor of rank `2`.

The next few lines of  `BiRNN`
```python
    #Hidden layer with clipped RELU activation and dropout
    layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(_X, _weights['h1']), _biases['b1'])), relu_clip)
    layer_1 = tf.nn.dropout(layer_1, keep_prob)
    ...
```
pass `_X` through the first layer of the non-recurrent neural network, then apply dropout to the result.

The next few lines do the same thing, but for the second and third layers
```python
    #Hidden layer with clipped RELU activation and dropout
    layer_2 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_1, _weights['h2']), _biases['b2'])), relu_clip)
    layer_2 = tf.nn.dropout(layer_2, keep_prob)
    #Hidden layer with clipped RELU activation and dropout
    layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, _weights['h3']), _biases['b3'])), relu_clip)
    layer_3 = tf.nn.dropout(layer_3, keep_prob)
```

Next we create the forward and backward LSTM units
```python
    # Define lstm cells with tensorflow
    # Forward direction cell
    lstm_fw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
    # Backward direction cell
    lstm_bw_cell = tf.nn.rnn_cell.BasicLSTMCell(n_cell_dim, forget_bias=1.0)
```
both of which have inputs of length `n_cell_dim` and bias `1.0` for the forget gate of the LSTM.

The next line of the funtion `BiRNN` does a bit more data preparation.
```python
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    layer_3 = tf.split(0, n_steps, layer_3)
```
It splits `layer_3` in to `n_steps` tensors along dimension `0` as the LSTM BRNN expects its input to be of shape `n_steps *[batch_size, 2*n_cell_dim]`.

The next line of `BiRNN`
```python
    # Get lstm cell output
    outputs, output_state_fw, output_state_bw  = tf.nn.bidirectional_rnn(cell_fw=lstm_fw_cell,
                                                                         cell_bw=lstm_bw_cell,
                                                                         inputs=layer_3,
                                                                         dtype=tf.float32)
```
feeds `layer_3` to the LSTM BRNN cell and obtains the LSTM BRNN output.

The next lines convert `outputs` from a list of rank two tensors into a single rank two tensor in preparation for passing it to the next neural network layer  
```python
    # Reshape outputs from a list of n_steps tensors each of shape [batch_size, 2*n_cell_dim]
    # to a single tensor of shape [n_steps*batch_size, 2*n_cell_dim]
    outputs = tf.pack(outputs)
    outputs = tf.reshape(outputs, [-1, 2*n_cell_dim])
```

The next couple of lines feed `outputs` to the fifth hidden layer
```python
    #Hidden layer with clipped RELU activation and dropout
    layer_5 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(outputs, _weights['h5']), _biases['b5'])), relu_clip)
    layer_5 = tf.nn.dropout(layer_5, keep_prob)
```

The next line of `BiRNN`
```python
    #Hidden layer of logits
    layer_6 = tf.add(tf.matmul(layer_5, _weights['h6']), _biases['b6'])
```
Applies the weight matrix `_weights['h6']` and bias `_biases['h6']`to the output of `layer_5` creating `n_classes` dimensional vectors, the logits.

The next lines of `BiRNN`
```python
    # Reshape layer_6 from a tensor of shape [n_steps*batch_size, n_hidden_6]
    # to a tensor of shape [batch_size, n_steps, n_hidden_6]
    layer_6 = tf.reshape(layer_6, [n_steps, batch_size, n_hidden_6])
    layer_6 = tf.transpose(layer_6, [1, 0, 2])  # Permute n_steps and batch_size
```
reshapes `layer_6` to the slightly more useful shape `[batch_size, n_steps, n_hidden_6]`.

The final line of `BiRNN` returns `layer_6`
```python
    # Return layer_6
    return layer_6
```

Next we actually call `BiRNN` with the apropos data

In [19]:
layer_6 = BiRNN(x, weights, biases)

# Loss Function

In accord with [Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567), the loss function used by our network should be the CTC loss function[[2]](http://www.cs.toronto.edu/~graves/preprint.pdf). Conveniently, this loss function is implemented in TensorFlow. Thus, we can simply make use of this implementation to define our loss.

In [20]:
# CTC loss requires layer_6 be time major
layer_6 = tf.transpose(layer_6, [1, 0, 2])

# Compute the CTC loss
total_loss = ctc_ops.ctc_loss(layer_6, y, seq_len)

Now, instead of using the total loss for the entire batch we want to calculate the average loss across the batch to facilitate comparing results as the batch size varies. So we calculate the following 

In [21]:
avg_loss = tf.reduce_mean(total_loss)

# Optimizer

In constrast to [Deep Speech: Scaling up end-to-end speech recognition](http://arxiv.org/abs/1412.5567), in which  [Nesterov’s Accelerated Gradient Descent](www.cs.toronto.edu/~fritz/absps/momentum.pdf) was used, we will use the Adam method for optimization[[3](http://arxiv.org/abs/1412.6980)], because, generally, it requires less fine-tuning.

In [22]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
                                   beta1=beta1,
                                   beta2=beta2,
                                   epsilon=epsilon).minimize(avg_loss)

# Decoder

Next to monitor training progress we will intoduce an operator used to decode

In [23]:
decoded, _ = ctc_ops.ctc_beam_search_decoder(layer_6, seq_len)

Using this decoding operator we can then calculate the CER, otherwise known as accuracy, of the system

In [24]:
acc = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32), y))

# Training

Now we will begin the process of training the network

In [25]:
with tf.Session() as session:
    # Initialize all variables
    tf.initialize_all_variables().run()
    
    # Loop over the data set for training_epochs epochs
    for epoch in range(training_iters):
        # Define total_loss
        total_loss = 0
        
        # Define character error rate
        train_cer = 0
        
        # Determine the total number of batches
        total_batch = int(ted_lium.train.num_examples/batch_size)
        
        # Loop over the batches
        for batch in range(total_batch):
            # Obtain the next batch of data
            batch_x, batch_y, batch_seq_len = ted_lium.train.next_batch(batch_size)
            
            # Create a map to fill the placeholders with batch data
            feed = {x: batch_x,
                    y: batch_y,
                    seq_len: batch_seq_len,
                    keep_prob: (1 - dropout_rate)}
            
            # Train on the current batch
            batch_avg_loss, _ = session.run([avg_loss, optimizer], feed)
            train_cer += session.run(acc, feed_dict=feed)
            
            # Add batch_avg_loss to total_loss
            total_loss += batch_avg_loss
            
        if epoch % display_step == 0:
            print "Epoch:", '%04d' % (epoch+1), "avg_cer=", "{:.9f}".format((train_cer / total_batch))

    # Indicate optimization has concluded
    print "Optimization Finished!"
    
    # Decoding
    d = session.run(decoded[0], feed_dict=feed)
    str_decoded = ''.join([chr(xt) for xt in np.asarray(d[1]) + (ord('a') - 1 )])
    # Replacing blank label to none
    str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
    # Replacing space label to space
    str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')
    print('Decoded:\n%s' % str_decoded)

Epoch: 0001 avg_cer= 3.557692289
Epoch: 0002 avg_cer= 3.076923132
Epoch: 0003 avg_cer= 3.307692289
Epoch: 0004 avg_cer= 3.365384579
Epoch: 0005 avg_cer= 3.673076868
Epoch: 0006 avg_cer= 4.211538315
Epoch: 0007 avg_cer= 4.153846264
Epoch: 0008 avg_cer= 4.211538315
Epoch: 0009 avg_cer= 4.096153736
Epoch: 0010 avg_cer= 4.307692528
Epoch: 0011 avg_cer= 4.423077106
Epoch: 0012 avg_cer= 4.153846264
Epoch: 0013 avg_cer= 4.134615421
Epoch: 0014 avg_cer= 3.961538553
Epoch: 0015 avg_cer= 4.057692528
Epoch: 0016 avg_cer= 4.096153736
Epoch: 0017 avg_cer= 4.096153736
Epoch: 0018 avg_cer= 4.115384579
Epoch: 0019 avg_cer= 4.134615421
Epoch: 0020 avg_cer= 4.153846264
Epoch: 0021 avg_cer= 4.076922894
Epoch: 0022 avg_cer= 4.307692528
Epoch: 0023 avg_cer= 4.134615421
Epoch: 0024 avg_cer= 4.115384579
Epoch: 0025 avg_cer= 4.173077106
Epoch: 0026 avg_cer= 4.442307472
Epoch: 0027 avg_cer= 4.346153736
Epoch: 0028 avg_cer= 4.153846264
Epoch: 0029 avg_cer= 3.750000000
Epoch: 0030 avg_cer= 3.923076868
Epoch: 003