<a href="https://colab.research.google.com/github/michaelgfalk/clean-ocr/blob/master/ocr_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Waves of Words: Correcting Trove's Messy OCR

One aim of the *Waves of Words* project is to extract Aboriginal wordlists from [Trove](https://trove.nla.gov.au). A challenge we face is that historical newspapers are difficult to OCR, so many of the texts are riddled with errors.

Using the training data available from the ALTA 2017 OCR competition, can we create a model that will clean the text enough for our aboriginal word detector to work?

I have been giving some thought to whether uppercase letters and punctuation should be preserved in this model, given that the aim is to clean up the text for our detector, which only requires lower case letters and ignores punctuation. I think we need to include all the characters in this one. The extra information about sentence barriers, for example, should hopefully help the model as it would a human when it tries to correct the text. Moreover, many OCR errors involve exchaning punctuation or digits for letters, e.g. `l = 1 = !`.

**References:**

* D. Mollá, S. Cassidy. Overview of the 2017 ALTA Shared Task:
Correcting OCR Errors (2017). *Proc. ALTA 2017*.
[https://aclanthology.coli.uni-saarland.de/papers/U17-1014/u17-1014](https://aclanthology.coli.uni-saarland.de/papers/U17-1014/u17-1014)

In [0]:
# Install TensorFlow2
!pip install -q tensorflow-gpu==2.0.0-alpha0

[K    100% |████████████████████████████████| 332.1MB 31kB/s 
[K    100% |████████████████████████████████| 61kB 12.7MB/s 
[K    100% |████████████████████████████████| 3.0MB 2.7MB/s 
[K    100% |████████████████████████████████| 419kB 7.7MB/s 
[?25h

In [0]:
from __future__ import absolute_import, division, print_function

from google.colab import drive

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import csv
import numpy as np
import os
import time
from datetime import date
import pickle as p

from sklearn.model_selection import train_test_split

In [0]:
# Mount google drive to get training data. Set data_dir
drive.mount('/content/gdrive')
data_dir = '/content/gdrive/My Drive/waves_of_words/ocr_correction_data/'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


# 1. Data Pipeline (new)

This new data pipeline makes use of Tensorflows Dataset object to manage the extraction and transformation more efficiently.

In [0]:
def create_dataset(source, target, start = 'स', end = 'ए'):
  """Imports articles from csv and packages for training."""
  
  # Lists for import
  raw_x = []
  raw_y = []
  
  # Import raw text
  with open(source, "rt") as f:
    x_reader = csv.reader(f, delimiter = ',', quotechar = '"')
    for row in x_reader:
      raw_x.append(row[1])
  with open(target, "rt") as f:
    y_reader = csv.reader(f, delimiter = ',', quotechar = '"')
    for row in y_reader:
      raw_y.append(row[1])
  
  # Drop header rows
  raw_x = raw_x[1:]
  raw_y = raw_y[1:]
  
  # Add special characters
  x = [start + article + end for article in raw_x if start not in article and end not in article]
  y = [start + article + end for article in raw_y if start not in article and end not in article]
  
  return x, y

In [0]:
def tokenize(source, target, tkzr = None):
  """Instantiates tokenizer if not passed, fits and applies."""
  
  if tkzr is None:
    # Fit tokenizer
    tkzr = tf.keras.preprocessing.text.Tokenizer(
      num_words = None,
      filters = None,
      lower = False,
      char_level = True
    )
    tkzr.fit_on_texts(source + target)
  
  # Apply to texts
  x = tkzr.texts_to_sequences(source)
  y = tkzr.texts_to_sequences(target)
  
  return x, y, tkzr

In [0]:
def split_and_stack(x, y, max_len, batch_size, drop = False):
  """Takes as input two python lists, and outputs a list of tensor buckets.
  
  Arguments:
  ==========
  x (list): the tokenized source strings
  y (list): the tokenized target strings
  max_len (int): the maximum number of time steps the model will consider
  batch_size (int): the batch size for the training examples
  drop (bool): keep the final batch, if len(x) is not a multiple of batch_size?
  
  Returns:
  ==========
  bucket_list (list): a list of length m, each item of which is a bucket of
    similar-length numpy array
  
  A bucket is a tensor of shape (2, m_prime, max_len). Training examples are first
  bucketed into groups of similar length, and seperated into batches of batch_size.
  Each batch is then padded out to an integer multiple of max_len, split into chunks
  of length max_len and stacked.
  
  This bucketing, splitting and stacking allows the data to be fed to a stateful RNN.
  
  Dimensions:
  -The first dimension seperates x examples from y
  -The second dimension seperates individual training examples.
    m_prime = batch_size * ⌈max_len_x_or_y_within_batch / batch_size⌉ 
  -The third dimension seperates individual time-steps, and is fixed at max_len"""
  
  # Set number of training examples (round up if not dropping)
  assert len(x) == len(y)
  m = len(x)
  
  # Sort x and y by sequence length
  x_y = sorted(zip(x,y), key = lambda tup: max(len(tup[0]), len(tup[1])), reverse = True)
  
  # Loop through list and create batches
  bucket_list = []
  steps_per_epoch = 0
  for i in range(0, m, batch_size):
    
    # Slice and unpack the list
    bn = batch_size
    if len(x_y) < batch_size and drop:
      break
    elif len(x_y) < batch_size and not drop:
      bn = len(x_y)
    
    bx, by = zip(*[x_y.pop() for _ in range(bn)])
    
    # Calculate length boundary and m_prime
    bl = max(len(bx[0]), len(by[0])) # Get the length of x or y, whichever is greater
    b = max_len - (bl % max_len) + bl # Round up to a multiple of max_len
    m_prime = int(b / max_len) # Calculate the new number of rows
    
    steps_per_epoch += m_prime
    
    # Pad the sequences
    x_pad = pad_sequences(bx, maxlen = b, padding = 'post')
    y_pad = pad_sequences(by, maxlen = b, padding = 'post')
    
    # Flip x on the time dimension
    x_flipped = np.flip(x_pad, axis = 1)
    
    # Split and stack
    x_out = np.concatenate(np.split(x_pad, m_prime, axis = 1), axis = 0)
    y_out = np.concatenate(np.split(y_pad, m_prime, axis = 1), axis = 0)
    
    # Covert to dataset and append to list
    bucket = tf.data.Dataset.from_tensor_slices((x_out, y_out))
    bucket = bucket.batch(bn) # NB: bn = batch_size except on the last batch, if drop != True
    bucket_list.append(bucket)
    
  return bucket_list, steps_per_epoch

In [0]:
def load_dataset(x_path, y_path, max_len = 20, batch_size = 128, drop = False, test_size = .2):
  
  x, y = create_dataset(x_path, y_path)
  
  x, y, tkzr = tokenize(x, y)
  
  seqs = (x_train, x_test, y_train, y_test) = train_test_split(x, y, test_size = test_size)
  
  train_buckets, train_steps = split_and_stack(x_train, y_train, max_len, batch_size, drop)
  
  test_buckets, val_steps = split_and_stack(x_test, y_test, max_len, batch_size, drop)
  
  return train_buckets, test_buckets, tkzr, seqs, train_steps, val_steps

# 2. Model Definition

Adapted from TensorFlow docs.

In [0]:
class StatefulGRU(tf.keras.layers.GRU):
  """GRU layer with all the necessaries."""
  def __init__(self, units):
    super(StatefulGRU, self).__init__(
        units = units,
        # The following parameters must be set this way
        # to use CuDNN on GPU
        activation='tanh',
        recurrent_activation='sigmoid',
        recurrent_dropout=0,
        unroll=False,
        use_bias=True,
        reset_after=True,
        # The following parameters are necessary for the
        # encoder-decoder architecture
        return_sequences=True, 
        return_state=True,
        # Stateful must be 'True' in order
        # to link the batches in each hyperbatch
        stateful=True,
        # Just the standard initializer
        recurrent_initializer='glorot_uniform'
    )

In [0]:
class Encoder(tf.keras.Model):
  def __init__(self, num_chars, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(num_chars, embedding_dim)
    self.first_gru = StatefulGRU(self.enc_units)
    self.second_gru = StatefulGRU(self.enc_units)

  def call(self, x, hidden):
    x = self.embedding(x)
    x = self.first_gru(x, initial_state = hidden)
    output, state = self.second_gru(x)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

In [0]:
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
  
  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, hidden_size)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

In [0]:
class Decoder(tf.keras.Model):
  def __init__(self, num_chars, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(num_chars, embedding_dim)
    self.first_gru = StatefulGRU(self.dec_units)
    self.second_gru = StatefulGRU(self.dec_units)
    self.fc = tf.keras.layers.Dense(num_chars)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRUs
    x = self.first_gru(x)
    output, state = self.second_gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

# 3. Set up Training

In [0]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def loss_function(real, pred):
  # Mask: ignore the model's predictions where the ground truth is padding
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  
  # Calculate the loss
  loss_ = loss_object(real, pred)

  # Make mask compatible with the loss output
  mask = tf.cast(mask, dtype=loss_.dtype)
  
  # Multiply the losses by the mask (i.e. zero out all losses where there's just padding)
  loss_ *= mask
  
  return tf.reduce_mean(loss_)

In [0]:
# Metrics for training
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_acc = tf.keras.metrics.SparseCategoricalAccuracy(name='train_acc')
val_loss = tf.keras.metrics.Mean(name='val_loss')
val_acc = tf.keras.metrics.SparseCategoricalAccuracy(name='val_acc')

def update_accuracy(real, pred, acc_object):
  
  # Find padding
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  
  # If there are no non-padding variables, break out of function
  if tf.math.count_nonzero(mask) == 0:
    return None
  
  # Slice tensors
  real = tf.boolean_mask(real, mask)
  pred = tf.boolean_mask(pred, mask)

  # Compute accuracy
  acc_object.update_state(real, pred)
  
  return None

In [0]:
@tf.function
def train_step(inp, targ, enc_hidden, norm_lim):
  loss = 0
        
  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims(inp[:,0], 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)
      _ = update_accuracy(targ[:, t], predictions, train_acc)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  train_loss.update_state(loss)

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)
  
  # Clip gradients
  clipped_gradients = [tf.clip_by_norm(grad, norm_lim) for grad in gradients]

  optimizer.apply_gradients(zip(clipped_gradients, variables))
  
  return train_loss.result(), train_acc.result()

In [0]:
@tf.function
def val_step(inp, targ, enc_hidden):
  
  loss = 0
  
  # Begin feeding data to network
  enc_output, enc_hidden = encoder(inp, enc_hidden)
  dec_hidden = enc_hidden
  dec_input = tf.expand_dims(inp[:,0], 1)
  
  # Cycle through the rest of the time steps
  for t in range(1, targ.shape[1]):
    # Pass enc_output to the decoder
    predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
    
    # Calculate loss and acc
    loss += loss_function(targ[:,t], predictions)
    _ = update_accuracy(targ[:, t], predictions, val_acc)
    
    # Pass the next correct letter to the decoder (teacher forcing)
    dec_input = tf.expand_dims(targ[:,t], 1)
    
  # Calculate val_loss
  val_loss.update_state(loss)
  
  return val_loss.result(), val_acc.result()

In [0]:
def format_time(flt):
  h = flt//3600
  m = (flt % 3600)//60
  s = flt % 60
  out = []
  if h > 0:
    out.append(str(int(h)))
    if h == 1:
      out.append('hr,')
    else:
      out.append('hrs,')
  if m > 0:
    out.append(str(int(m)))
    if m == 1:
      out.append('min, and')
    else:
      out.append('mins, and')
  out.append(f'{s:.2f}')
  out.append('secs')
  return ' '.join(out)

# 4. Run Training Loop

In [0]:
# Set hyperparameters
MAX_LEN = 32
BATCH_SIZE = 128
EPOCHS = 20
x_dir = data_dir + 'train_input.csv'
y_dir = data_dir + 'train_output.csv'
NORM_LIM = 3 # value for clip_norm

# Load data
train_buckets, test_buckets, tkzr, seqs, train_steps, val_steps = load_dataset(x_dir, y_dir, MAX_LEN, BATCH_SIZE, drop = True)

# Save preprocessed training data
with open(data_dir + "/" + date.isoformat(date.today()) + "-training-data-and-tkzr.pickle", 'wb') as f:
  p.dump((tkzr, seqs, train_steps, val_steps), f)

# Get vocab size
num_chars = len(tkzr.word_index) + 1 # Add one for padding
embedding_dim = 25
units = 50

# Define model(s)
encoder = Encoder(num_chars, embedding_dim, units, batch_sz = BATCH_SIZE)
decoder = Decoder(num_chars, embedding_dim, units, batch_sz = BATCH_SIZE)

In [0]:
checkpoint_dir = data_dir + 'checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [0]:
# Loop over epochs
for epoch in range(EPOCHS):
  print(f'Starting Epoch {epoch + 1}\n')
  
  train_loss.reset_states()
  train_acc.reset_states()
  val_loss.reset_states()
  val_acc.reset_states()
  
  start = time.time()
  
  total_batches = 0
  val_batches = 0

  # Loop over buckets
  for bucket, dataset in enumerate(train_buckets):
    # Reset hidden state
    enc_hidden = encoder.initialize_hidden_state()


    for inp, targ in dataset.take(-1):
      loss, acc = train_step(inp, targ, enc_hidden, NORM_LIM)
      
      total_batches += 1
    
      if total_batches % 250 == 0:
          print(f'Epoch {epoch + 1} Bucket {bucket + 1}: Loss {loss:.4f}, Acc {acc:.4f} after {total_batches} batches')
  
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)
    
  # Calculate validation loss and accuracy
  for dataset in test_buckets:
    # Reset hidden state
    enc_hidden = encoder.initialize_hidden_state()
    
    for inp, targ in dataset.take(-1):
      val_loss_, val_acc_ = val_step(inp, targ, enc_hidden)
      
      val_batches += 1

  print(f'\nEpoch {epoch + 1} Loss {loss:.2f}, Avg Acc {acc*100:.2f}%.')
  print(f'Tested on {val_batches * BATCH_SIZE} validation examples:')
  print(f'val_loss = {val_loss_:.2f} val_acc = {val_acc_*100:.2f}%')
  print(f'Time taken for 1 epoch: {format_time(time.time() - start)}\n===========================\n\n')

Starting Epoch 1

Epoch 1 Bucket 15: Loss 100.2790, Acc 0.1914 after 250 batches
Epoch 1 Bucket 21: Loss 90.4950, Acc 0.2380 after 500 batches
Epoch 1 Bucket 25: Loss 85.5238, Acc 0.2640 after 750 batches
Epoch 1 Bucket 28: Loss 82.4062, Acc 0.2824 after 1000 batches
Epoch 1 Bucket 31: Loss 80.3263, Acc 0.2964 after 1250 batches
Epoch 1 Bucket 32: Loss 78.8036, Acc 0.3076 after 1500 batches
Epoch 1 Bucket 34: Loss 77.5342, Acc 0.3171 after 1750 batches
Epoch 1 Bucket 35: Loss 76.6332, Acc 0.3246 after 2000 batches
Epoch 1 Bucket 36: Loss 75.8023, Acc 0.3311 after 2250 batches
Epoch 1 Bucket 37: Loss 75.2700, Acc 0.3359 after 2500 batches

Epoch 1 Loss 75.12, Avg Acc 33.76%.
Tested on 63616 validation examples:
val_loss = 67.67 val_acc = 39.05%
Time taken for 1 epoch: 28 mins, and 21.75 secs


Starting Epoch 2

Epoch 2 Bucket 15: Loss 63.9322, Acc 0.4054 after 250 batches
Epoch 2 Bucket 21: Loss 63.9222, Acc 0.4100 after 500 batches
Epoch 2 Bucket 25: Loss 63.8943, Acc 0.4124 after 750 