## Basic Usage

This file demonstrates how to use the seq2seq model to process noisy text.

The example used in this notebook is a truncated article from the 11/1/1848 issue of the London Times. Let's first print it out to see what we're working with.

In [1]:
with open("sample-files/floods.txt", 'r') as f:
    sample_text = f.read()

sample_text

"T Er LooDs AND THE RAILwAYS--The flbods that dwing the lut fortnight have been so unprecedented,particularly in the eastern and north-eastem districts, have' put toaseverer test than usual the strength and solidity of the ralways The ordinary, roads Iave been rendered im- asmbleOn the branch lfnes of the Eastern Counties, from yto Peterboroughandbeyond, tho e arable andmeadoweands are two and thro3 feet- un4erfor six and seven miles around, o that viewed from a train in translt, the cointry presents nothing but a vast aseet of vvater, from ilhich- neither the canals nor the rivers can be distingished "

This text is very noisy and it's difficult to piece together exactly what the article's saying. We can begin by importing the model from the seq2seqocr.py file and cleaning up our text with the 'preprocess' method

In [2]:
# these two lines help with locating the file from this notebook
import sys
sys.path.append('../')

from seq2seqocr import Seq2SeqOCR

# initalize the model. By default, it assumes the model is saved at MODEL_PATH in settings.py
model = Seq2SeqOCR()



2021-09-05 20:22:49.195792: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
preprocessed = model.preprocess(sample_text)

preprocessed

't er [CLS]loods and the [CLS]railwaysthe [CLS]flbods that [CLS]dwing the lut fortnight have been so unprecedented particularly in the eastern and [CLS]northeastem districts have put [CLS]toaseverer test than usual the strength and solidity of the [CLS]ralways the ordinary roads [CLS]iave been rend ered [CLS]imasmbleon the branch [CLS]lfnes of the eastern counties from yto peter borough and beyond thoe arable [CLS]andmeadoweands are two and thro [CLS]feetunerfor six and seven miles around o that viewed from a train in [CLS]translt the [CLS]cointry presents nothing but a vast [CLS]aseet of [CLS]vvater from [CLS]ilhichneither the [CLS]canals nor the rivers can be [CLS]distingished'

For each word in our text, 'preprocess' goes through several rounds of checks to clean up spacing issues. If it cannot recognize the word at the end of these checks, it tags the word with a classification token ('CLS'). 

The intent of this preprocessing stage is to clean up the text for our model, as well as correct any obvious issues. The classification tokens designate which words we should run the model on. We can do that with the 'process_text' method.

In [4]:
infer_text = model.process_text(preprocessed)

infer_text

2021-09-05 20:22:52.574755: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)


't er loods and the railwaysthe flbods that dwing the lut fortnight have been so unprecedented particularly in the eastern and northeastem districts have put toaseverer test than usual the strength and solidity of the ralways the ordinary roads have been rend ered imasmbleon the branch lfnes of the eastern counties from yto peter borough and beyond thoe arable andmeadoweands are two and thro feetunerfor six and seven miles around o that viewed from a train in transit the country presents nothing but a vast affect of vvater from ilhichneither the canals nor the rivers can be distingished'

In [5]:
# NOTE: we do not need to preprocess the text before calling model.process_text().
# The process_text() method itself has a call to preprocess() inside it.

infer_text_no_preprocess = model.process_text(sample_text)

assert (infer_text_no_preprocess == infer_text)

By default, process_text() is evoked with the safe_mode parameter set to true. In safe mode, the program does not replace the original word if the model's output is not itself a recognizable English word. We can see the model's output for all words by setting the safe_mode parameter to False.

In [6]:
infer_text_no_safe_mode = model.process_text(sample_text, safe_mode=False)

infer_text_no_safe_mode

't er boods and the railwawarvie sthoof that dwing the lut fortnight have been so unprecedented particularly in the eastern and northeaster districts have put twesewerer test than usual the strength and solidity of the ralwwfs the ordinary roads have been rend ered immashifine the branch isness of the eastern counties from yto peter borough and beyond thoe arable admendacednew are two and thro secturensor six and seven miles around o that viewed from a train in transit the country presents nothing but a vast affect of whoter from ilibernechiel the canass nor the rivers can be distinghished'

Let's compare the two texts to see what corrections were rejected in safe mode.

In [7]:
for orig_word, rej_suggestion in zip(infer_text.split(), infer_text_no_safe_mode.split()):
    if orig_word == rej_suggestion: continue
    print(f"{orig_word} ---> {rej_suggestion}")

loods ---> boods
railwaysthe ---> railwawarvie
flbods ---> sthoof
northeastem ---> northeaster
toaseverer ---> twesewerer
ralways ---> ralwwfs
imasmbleon ---> immashifine
lfnes ---> isness
andmeadoweands ---> admendacednew
feetunerfor ---> secturensor
vvater ---> whoter
ilhichneither ---> ilibernechiel
canals ---> canass
distingished ---> distinghished


Some of these inferences are a bit off of the mark, so it's good to see that they were rejected. Let's see what suggestions our safe_mode prediction accepted.

In [8]:
from seq2seqocr import CLASSIFICATION_TKN

for pped_word, replacement in zip(preprocessed.split(), infer_text.split()):
    if pped_word.startswith(CLASSIFICATION_TKN):
        word_without_tag = pped_word.lstrip(CLASSIFICATION_TKN)
        if (word_without_tag != replacement):
            print(f"{word_without_tag} ---> {replacement}")

iave ---> have
translt ---> transit
cointry ---> country
aseet ---> affect


Some of these suggestions are still ambiguious, but clearly, the changes for 'translt'->'transit' and 'cointry'->'country' are useful corrections that might otherwise be difficult to spot. Even the 'iave'->'have' correction seems valid, and upon further review it can be seen that 'have' in fact the correct word:

**Original:** "The ordinary, roads ***Iave*** been rendered im- asmbleOn"  
**Real:** "The ordinary roads ***have*** been rendered impassable."
<br/>

We can compare the output of our model to that of PyEnchant's Spellcheck tool.

In [9]:
import enchant

d_uk = enchant.Dict("en_UK")

for bad_word in ['iave', 'translt', 'cointry']:
    print(f"{bad_word} ---> {d_uk.suggest(bad_word)[:5]}")  # top 5 suggestions

iave ---> ['Ave', 'ave', "I've", 'eave', 'Iva']
translt ---> ['transl', 'translate', 'transit', 'translator', 'transact']
cointry ---> ['country', 'contra', 'Cointreau', 'coin try', 'coin-try']


The 'iave' suggestions are nowhere close to correct, as the spellchecker has no way to realize that the lowercase 'h' has been misread in this context.  

The 'l' in 'translt' throws the spellchecker off, leading it to suggest words stemming from 'translate' as the most-likely corrections. It does suggest 'transit' in there, but it is only the 3rd most-likely correction.  

Finally, the 'cointry'->'country' correction is correctly performed by the spellchecking tool.
<br/>

So it's clear to see that our model outperforms other correction tools in this context. If you want, you can combine the predictions of the seq2seq model with other tools such as PyEnchant's spellchecker. If you review the rejection corrections, you'll notice that a few of them could proabably be fixed by a spellchecking tool (e.g., 'distingished'->'distinguished'). An example combining the seq2seq model and PyEnchant is demonstrated below.


In [10]:
# example combination of seq2seq and PyEnchant spellchecker

corrected_text = []

# this is the same code as before, except now we run the spellchecker over unchanged words
for pped_word, replacement in zip(preprocessed.split(), infer_text.split()):
    if pped_word.startswith(CLASSIFICATION_TKN):
        word_without_tag = pped_word.lstrip(CLASSIFICATION_TKN)
        if (word_without_tag == replacement):
            spellcheck_word = d_uk.suggest(word_without_tag)[0]  # takes the first result
            print(f"{word_without_tag} ---> {spellcheck_word}")
            corrected_text.append(spellcheck_word)
        else:
            corrected_text.append(replacement)
    else:
        corrected_text.append(pped_word)

" ".join(corrected_text)


loods ---> loos
railwaysthe ---> railways the
flbods ---> floods
dwing ---> swing
northeastem ---> northeaster
toaseverer ---> severer
ralways ---> railways
imasmbleon ---> assembling
lfnes ---> lanes
andmeadoweands ---> understands
feetunerfor ---> fortune
vvater ---> vaster
ilhichneither ---> unhitching
canals ---> canals
distingished ---> distinguished


't er loos and the railways the floods that swing the lut fortnight have been so unprecedented particularly in the eastern and northeaster districts have put severer test than usual the strength and solidity of the railways the ordinary roads have been rend ered assembling the branch lanes of the eastern counties from yto peter borough and beyond thoe arable understands are two and thro fortune six and seven miles around o that viewed from a train in transit the country presents nothing but a vast affect of vaster from unhitching the canals nor the rivers can be distinguished'

This approach ends up correcting several mistakes that squeezed by our model. Notably, it seperates 'railwaysthe' and correctly identifies words like 'flbods' and 'ralways' as misspellings. However, note that the spellchecker sometimes fails when it comes to longer and noisier words ('andmeadoweands'->'understands'), and it is difficult to determine when exactly the model fails. A suggestion is that one could implement some sort of check by calculating the Levensthein distance between the two words.  

I ultimately leave it up to the user to decide whether or not to include other spellchecking programs along with the seq2seq model in their correction. The seq2seq model is a useful tool that can identify common mistakes and hopefully offer correct suggestions.  

I hope that these notebook examples have been useful in demonstrating how to use the seq2seq model by itself and in combination with other tools. To see more on how to customize the model to fit your needs, see [custom_usage](custom_usage.ipynb).