<a href="https://colab.research.google.com/github/michaelgfalk/clean-ocr/blob/master/ocr_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Waves of Words: Correcting Trove's Messy OCR

One aim of the *Waves of Words* project is to extract Aboriginal wordlists from [Trove](https://trove.nla.gov.au). A challenge we face is that historical newspapers are difficult to OCR, so many of the texts are riddled with errors.

Using the training data available from the ALTA 2017 OCR competition, can we create a model that will clean the text enough for our aboriginal word detector to work?

I have been giving some thought to whether uppercase letters and punctuation should be preserved in this model, given that the aim is to clean up the text for our detector, which only requires lower case letters and ignores punctuation. I think we need to include all the characters in this one. The extra information about sentence barriers, for example, should hopefully help the model as it would a human when it tries to correct the text. Moreover, many OCR errors involve exchaning punctuation or digits for letters, e.g. `l = 1 = !`.

**References:**

* D. Mollá, S. Cassidy. Overview of the 2017 ALTA Shared Task:
Correcting OCR Errors (2017). *Proc. ALTA 2017*.
[https://aclanthology.coli.uni-saarland.de/papers/U17-1014/u17-1014](https://aclanthology.coli.uni-saarland.de/papers/U17-1014/u17-1014)

In [2]:
# Install TensorFlow2
!pip install -q tensorflow-gpu==2.0.0-alpha0

[K    100% |████████████████████████████████| 332.1MB 60kB/s 
[K    100% |████████████████████████████████| 3.0MB 7.6MB/s 
[K    100% |████████████████████████████████| 61kB 20.0MB/s 
[K    100% |████████████████████████████████| 419kB 8.4MB/s 
[?25h

In [0]:
from __future__ import absolute_import, division, print_function

from google.colab import drive

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

import csv
import numpy as np

from sklearn.model_selection import train_test_split

In [4]:
# Mount google drive to get training data. Set data_dir
drive.mount('/content/gdrive')
data_dir = '/content/gdrive/My Drive/waves_of_words/ocr_correction_data/'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Add start and end, tokenize, pad, chunk
hyper_batches = []

# Start and end tokens
train_joined['original'] = start_char + train_joined['original'] + end_char
train_joined['solution'] = start_char + train_joined['solution'] + end_char

# Tokenise
x_tokens = tkzr.texts_to_sequences(train_joined['original'])
y_tokens = tkzr.texts_to_sequences(train_joined['solution'])

# Iterate over hyper_batches:
for i in range(0, num_articles, batch_size):
  hyper_batch = {}
  
  # Determine slice start and end points
  end = min(i + batch_size, num_articles)
  
  # Get articles for this batch
  batch_x = x_tokens[i:end]
  batch_y = y_tokens[i:end]
  
  # Determine max_len
  max_len = max([len(x) for x in batch_x] + [len(y) for y in batch_y])
  # Round up to chunk_size
  max_len += chunk_size - (max_len % chunk_size)
  
  # Pad sequences
  x_padded = pad_sequences(batch_x, maxlen = max_len)
  y_padded = pad_sequences(batch_y, maxlen = max_len)
  
  # Split and stack
  num_chunks = int(max_len / chunk_size)
  hyper_batch['X'] = np.concatenate(np.split(x_padded, num_chunks, axis = 1), axis = 0)
  hyper_batch['Y'] = np.concatenate(np.split(y_padded, num_chunks, axis = 1), axis = 0)
  
  # Create index
  hyper_batch['index'] = [x for x in product(range(0,num_chunks), range(0,batch_size))]
  
  # Append
  hyper_batches.append(hyper_batch)
  
  # Sanity check
  print(f'\nNext hyperbatch:')
  print(f'max_len = {max_len}')
  print(f'chunk_size = {chunk_size}')
  print(f'num_chunks = {num_chunks}')
  print(f'shape of X: {hyper_batch["X"].shape}')
  print(f'shape of Y: {hyper_batch["Y"].shape}')
  

This works, but the last three hyperbatches are gonna have a lot of zeros in them...

I have just found out that my 'hyperbatch' solution is NOT NEW. The original seq2seq model used them too, but called them 'buckets'.

# 1. Data Pipeline (new)

This new data pipeline makes use of Tensorflows Dataset object to manage the extraction and transformation more efficiently.

In [0]:
def create_dataset(source, target, start = 'स', end = 'ए'):
  """Imports articles from csv and packages for training."""
  
  # Lists for import
  raw_x = []
  raw_y = []
  
  # Import raw text
  with open(source, "rt") as f:
    x_reader = csv.reader(f, delimiter = ',', quotechar = '"')
    for row in x_reader:
      raw_x.append(row[1])
  with open(target, "rt") as f:
    y_reader = csv.reader(f, delimiter = ',', quotechar = '"')
    for row in y_reader:
      raw_y.append(row[1])
  
  # Drop header rows
  raw_x = raw_x[1:]
  raw_y = raw_y[1:]
  
  # Add special characters
  x = [start + article + end for article in raw_x if start not in article and end not in article]
  y = [start + article + end for article in raw_y if start not in article and end not in article]
  
  return x, y

In [0]:
def tokenize(source, target):
  """Instantiates tokenizer, fits and applies."""
  
  # Fit tokenizer
  tkzr = tf.keras.preprocessing.text.Tokenizer(
    num_words = None,
    filters = None,
    lower = False,
    char_level = True
  )
  tkzr.fit_on_texts(source + target)
  
  # Apply to texts
  x = tkzr.texts_to_sequences(source)
  y = tkzr.texts_to_sequences(target)
  
  return x, y, tkzr

In [0]:
def load_dataset(x_path, y_path):
  
  x, y = create_dataset(x_path, y_path)
  
  x, y, tkzr = tokenize(x, y)
  
  return x, y, tkzr

In [0]:
x, y, tkzr = load_dataset(data_dir + 'train_input.csv', data_dir + 'train_output.csv')

In [0]:
def convert(tkzr, tensor, n = 100):
  text = [tkzr.index_word[x] for x in tensor[:n] if x != 0]
  text = ''.join(text + ['...'])
  print(text)

In [59]:
print("The raw x...")
convert(tkzr, x[7])
print("The corresponding y...")
convert(tkzr, y[7])

The raw x...
सAnd Ten Years Aeo TSLINGTON rairway workshops are idle. This is occasioned by resentment of the men...
The corresponding y...
सAnd Ten Years Ago ISLINGTON railway workshops are idle. This is occasioned by resentment of the men...


In [115]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .2)

max_train = max([len(x) for x in x_train])
max_test = max([len(x) for x in x_test])

print(f'The longest training sequence has {max_train} characters.')
print(f'The longest test sequence has {max_test}.')

The longest training sequence has 48426 characters.
The longest test sequence has 46124.


In [0]:
data_gen = ((np.array(x, 'float32'), np.array(y, 'float32')) for x, y in zip(x_train, y_train))
dataset = tf.data.Dataset.from_generator(lambda: data_gen, tf.float32)

In [0]:
# Define function for bucketing and padding data
BATCH_SIZE = 64
MAX_LEN = 200

max_lens = [max(len(x),len(y)) for x,y in zip(x_train, y_train)]
max_lens = sorted(max_lens)

# Define upper length limit of buckets
bucket_boundaries = max_lens[0::BATCH_SIZE]
# round up to nearest multiple of max_len
bucket_boundaries = [MAX_LEN - (b % MAX_LEN) + b for b in bucket_boundaries]

# Generate list of batch sizes
batch_members = [BATCH_SIZE for _ in bucket_boundaries + [1]]

# Define function
bucket_func = tf.data.experimental.bucket_by_sequence_length(lambda x: x[0].shape[0], bucket_boundaries, batch_members)

ValueError: ignored

# 2. Model Definition

Adapted from TensorFlow docs.

In [0]:
class Encoder(tf.keras.Model):
  def __init__(self, num_chars, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(num_chars, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units, 
                                   # The following parameters must be set this way
                                   # to use CuDNN on GPU
                                   activation='tanh',
                                   recurrent_activation='sigmoid',
                                   recurrent_dropout=0,
                                   unroll=False,
                                   use_bias=True,
                                   reset_after=True,
                                   # The following parameters are necessary for the
                                   # encoder-decoder architecture
                                   return_sequences=True, 
                                   return_state=True,
                                   # Stateful must be 'True' in order
                                   # to link the batches in each hyperbatch
                                   stateful=True,
                                   # Just the standard initializer
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)        
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

In [0]:
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
  
  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, hidden_size)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

In [0]:
class Decoder(tf.keras.Model):
  def __init__(self, num_chars, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(num_chars, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units, 
                                   # The following parameters must be set this way
                                   # to use CuDNN on GPU
                                   activation='tanh',
                                   recurrent_activation='sigmoid',
                                   recurrent_dropout=0,
                                   unroll=False,
                                   use_bias=True,
                                   reset_after=True,
                                   # The following parameters are necessary for the
                                   # encoder-decoder architecture
                                   return_sequences=True, 
                                   return_state=True,
                                   # Stateful must be 'True' in order
                                   # to link the batches in each hyperbatch
                                   stateful=True,
                                   # Just the standard initializer
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(num_chars)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

# 3. Set up Training

In [0]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask
  
  return tf.reduce_mean(loss_)

In [0]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [0]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0
        
  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)       

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))
  
  return batch_loss

# 4. Run Training Loop

In [0]:
# Set hyperparameters
MAX_LEN = 200
BATCH_SIZE = 64
EPOCHS = 10

num_batches = 

In [0]:
# Loop over epochs
for epoch in range(EPOCHS):
  start = time.time()
  
  total_loss = 0

  # Loop over hyperbatches
  for hyperbatch in # NEED TO CREATE DATASET OBJECT:
    # Reset hidden state
    enc_hidden = encoder.initialize_hidden_state()


    for (batch, (inp, targ)) in # NEED TO CREATE DATASET OBJECT:
      batch_loss = train_step(inp, targ, enc_hidden)
      total_loss += batch_loss

      if batch % 100 == 0:
          print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                       batch,
                                                       batch_loss.numpy()))
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))