## Basic Usage

This file demonstrates how to use the seq2seq model to process noisy text.

The example used in this notebook is a truncated article from the 11/1/1848 issue of the London Times. Let's first print it out to see what we're working with.

In [1]:
with open("sample-files/floods.txt", 'r') as f:
    sample_text = f.read()

sample_text

"T Er LooDs AND THE RAILwAYS--The flbods that dwing the lut fortnight have been so unprecedented,particularly in the eastern and north-eastem districts, have' put toaseverer test than usual the strength and solidity of the ralways The ordinary, roads Iave been rendered im- asmbleOn the branch lfnes of the Eastern Counties, from yto Peterboroughandbeyond, tho e arable andmeadoweands are two and thro3 feet- un4erfor six and seven miles around, o that viewed from a train in translt, the cointry presents nothing but a vast aseet of vvater, from ilhich- neither the canals nor the rivers can be distingished "

This text is very noisy and it's difficult to piece together exactly what the article's saying. We can begin by importing the model from the seq2seqocr.py file and cleaning up our text with the 'preprocess' method

In [2]:
# these two lines to help with locating the file from this notebook
import sys
sys.path.append('../')

from seq2seqocr import Seq2SeqOCR

# initalize the model. By default, it assumes the model is saved at MODEL_PATH in settings.py
model = Seq2SeqOCR()



2021-09-05 17:48:48.784915: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
preprocessed = model.preprocess(sample_text)

preprocessed

'[CLS]t [CLS]er [CLS]loods and the [CLS]railwaysthe [CLS]flbods that [CLS]dwing the [CLS]lut fortnight have been so unprecedented particularly in the eastern and [CLS]northeastem districts have put [CLS]toaseverer test than usual the strength and solidity of the [CLS]ralways the ordinary roads [CLS]iave been rend ered [CLS]imasmbleon the branch [CLS]lfnes of the eastern counties from [CLS]yto peter borough and beyond thoe arable [CLS]andmeadoweands are two and thro [CLS]feetunerfor six and seven miles around [CLS]o that viewed from a train in [CLS]translt the [CLS]cointry presents nothing but a vast [CLS]aseet of [CLS]vvater from [CLS]ilhichneither the [CLS]canals nor the rivers can be [CLS]distingished'

For each word in our text, 'preprocess' goes through several rounds of checks to clean up spacing issues. If it cannot recognize the word at the end of these checks, it tags the word with a classification token ('CLS'). 

The intent of this preprocessing stage is to clean up the text for our model, as well as correct any obvious issues. The classification tokens designate which words we should run the model on. We can do that with the 'process_text' method.

In [4]:
infer_text = model.process_text(preprocessed)

infer_text

2021-09-05 17:48:51.982006: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)


t teal False
er eare False
loods boods False
railwaysthe railwawarvie False
flbods sthoof False
dwing dwing False
lut lant False
northeastem northeaster False
toaseverer twesewerer False
ralways ralwwfs False
iave have False
imasmbleon immashifine False
lfnes isness False
yto yeto False
andmeadoweands admendacednew False
feetunerfor secturensor False
o onloc False
translt transit False
cointry country False
aseet affect False
vvater whoter False
ilhichneither ilibernechiel False
canals canass False
distingished distinghished False


't er loods and the railwaysthe flbods that dwing the lut fortnight have been so unprecedented particularly in the eastern and northeastem districts have put toaseverer test than usual the strength and solidity of the ralways the ordinary roads iave been rend ered imasmbleon the branch lfnes of the eastern counties from yto peter borough and beyond thoe arable andmeadoweands are two and thro feetunerfor six and seven miles around o that viewed from a train in translt the cointry presents nothing but a vast aseet of vvater from ilhichneither the canals nor the rivers can be distingished'

In [5]:
# NOTE: we do not need to preprocess the text before calling model.process_text().
# The process_text() method itself has a call to preprocess() inside it.

infer_text_no_preprocess = model.process_text(sample_text)

assert (infer_text_no_preprocess == infer_text)

t teal False
er eare False
loods boods False
railwaysthe railwawarvie False
flbods sthoof False
dwing dwing False
lut lant False
northeastem northeaster False
toaseverer twesewerer False
ralways ralwwfs False
iave have False
imasmbleon immashifine False
lfnes isness False
yto yeto False
andmeadoweands admendacednew False
feetunerfor secturensor False
o onloc False
translt transit False
cointry country False
aseet affect False
vvater whoter False
ilhichneither ilibernechiel False
canals canass False
distingished distinghished False


By default, process_text() is evoked with the safe_mode parameter set to true. In safe mode, the program does not replace the original word if the model's output is not itself a recognizable English word. We can see the model's output for all words by setting the safe_mode parameter to False.

In [6]:
infer_text_no_safe_mode = model.process_text(sample_text, safe_mode=False)

infer_text_no_safe_mode

t teal True
er eare False
loods boods False
railwaysthe railwawarvie False
flbods sthoof False
dwing dwing False
lut lant False
northeastem northeaster False
toaseverer twesewerer False
ralways ralwwfs False
iave have True
imasmbleon immashifine False
lfnes isness False
yto yeto False
andmeadoweands admendacednew False
feetunerfor secturensor False
o onloc False
translt transit True
cointry country True
aseet affect True
vvater whoter False
ilhichneither ilibernechiel False
canals canass False
distingished distinghished False


'teal er loods and the railwaysthe flbods that dwing the lut fortnight have been so unprecedented particularly in the eastern and northeastem districts have put toaseverer test than usual the strength and solidity of the ralways the ordinary roads have been rend ered imasmbleon the branch lfnes of the eastern counties from yto peter borough and beyond thoe arable andmeadoweands are two and thro feetunerfor six and seven miles around o that viewed from a train in transit the country presents nothing but a vast affect of vvater from ilhichneither the canals nor the rivers can be distingished'

Let's compare the two texts to see what corrections were accepted.

In [7]:
for orig_word, rej_suggestion in zip(infer_text.split(), infer_text_no_safe_mode.split()):
    if orig_word == rej_suggestion: continue
    print(f"{orig_word} ---> {rej_suggestion}")

t ---> teal
iave ---> have
translt ---> transit
cointry ---> country
aseet ---> affect


In [8]:
'transit' in model.english_lexicon

True