# Custom Usage
This notebook demonstrates how to customize the notebook for your needs.

## English Lexicon/Hashsets
First up, the model 'recognizes' English words via two saved English lexicon files. The English lexicon compiled from millions of documents from the COHA corpus. The common lexicon is compiled from Google's 10000 most-used words list.  

The English lexicon is not comprehensive -- there are still many valid words that aren't contained within it, but it provides pretty broad coverage for this historical context.

In [1]:
# these two lines help with locating the file from this notebook
import sys
sys.path.append('../')

from seq2seqocr import Seq2SeqOCR

model = Seq2SeqOCR()  # init model

print("Size of English Lexicon: ", len(model.english_lexicon))
print("Size of Common Lexicon: ", len(model.common_lexicon))



2021-09-06 12:16:48.197448: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Size of English Lexicon:  47886
Size of Common Lexicon:  8591


If you want to change the contents of these lexicons, you can edit the path in settings.py to a different file or you can manually add them. The two lexicons are implemented as hashsets.

In [2]:
SPONGEBOB_REFERENCE = 'goofygooberrock'

print("Test word in English lexicon before adding: ", SPONGEBOB_REFERENCE in model.english_lexicon)

model.english_lexicon.add(SPONGEBOB_REFERENCE)  # adds word to set

print("Test word in English lexicon after adding: ", SPONGEBOB_REFERENCE in model.english_lexicon)


Test word in English lexicon before adding:  False
Test word in English lexicon after adding:  True


Note that these added words do not persist through different instances of the seq2seq model class.

In [3]:
other_instance = Seq2SeqOCR()

print("Test word in seperate instance's lexicon: ", SPONGEBOB_REFERENCE in other_instance.english_lexicon)

Test word in seperate instance's lexicon:  False


You can also remove words from the lexicon in the same way. Any functionality that exists with Python hashsets persists into these two lexicons.

In [4]:
print("Differences between both set instances: ", model.english_lexicon.difference(other_instance.english_lexicon))

model.english_lexicon.remove(SPONGEBOB_REFERENCE)

print("Test word in English lexicon after removal: ", SPONGEBOB_REFERENCE in model.english_lexicon)

Differences between both set instances:  {'goofygooberrock'}
Test word in English lexicon after removal:  False


If there are many changes you would like to make or if you would like those changes to persist each through each instance, I would recommend you pickle a new English lexicon and change the ENGLISH_LEXICON_PKL and COMMON_LEXICON_PKL variables in the settings.py file.

<br/>

## Compound-Splitting
One of the pre-processing checks involves recursively checking if substrings in a word are valid english words. There are a few checks on this splitting. Firstly, there is a 2-character 'buffer' around the end of the words, to prevent common prefixes and endings from throwing out false positives. Second, the average of all split-words must be at least 4 characters for the same reason.  

However, beacuse short words like 'er' and 'an' often lead to false positives, we have removed them from the common lexicon used to identify splittings. This can lead to issues, as joined-together words such as 'at the'->'atthe' will not be recognized and split. In the seq2seqocr.py file, we have defined a function, populate_compound_memory() that populates the memoized set with several of these common short-word compounds. Before the model tries to split these words, it always checks the memoized dictionary to see if a splitting has already been performed, and returns the entry if it finds one.  

This long section is just to say that you can define your own mappings that you want the program to correct in pre-processing. To do so, you can add entries in populate_compound_memory() using the provided format, or you can manually add them to the dict as follows:

In [5]:
HARRY_POTTER_REFERENCE = 'youreawizardharry'

print("Model preprocessing before adding test phrase: ", model.preprocess(HARRY_POTTER_REFERENCE))

model.memoized_words[HARRY_POTTER_REFERENCE] = "im a what?"

print("Model preprocessing after adding test phrase: ", model.preprocess(HARRY_POTTER_REFERENCE))

Model preprocessing before adding test phrase:  youreawizardharry
Model preprocessing after adding test phrase:  im a what?


Note that this does not work if the 'key' in the translation is a valid English word because that case is handled before the memoized dictionary is checked. For instance,

In [6]:
MARCO_POLO_REFERENCE = 'marco'

print("Test phrase recognized as valid English word? ", MARCO_POLO_REFERENCE in model.english_lexicon)

print("Model preprocessing before adding test phrase: ", model.preprocess(MARCO_POLO_REFERENCE))

model.memoized_words[MARCO_POLO_REFERENCE] = 'polo'

print("Model preprocessing after adding test phrase: ", model.preprocess(MARCO_POLO_REFERENCE))

Test phrase recognized as valid English word?  True
Model preprocessing before adding test phrase:  marco
Model preprocessing after adding test phrase:  marco


Also remember that all text gets converted to lowercase and punctuation gets stripped during pre-processing. 

In [7]:
PUNCTUAL_REFERENCE = 'LOL!!!'

print("Model preprocessing before adding test phrase: ", model.preprocess(PUNCTUAL_REFERENCE))

model.memoized_words[PUNCTUAL_REFERENCE] = 'laughing out loud exclamation mark x3'

print("Model preprocessing after adding test phrase: ", model.preprocess(PUNCTUAL_REFERENCE))

Model preprocessing before adding test phrase:  lol
Model preprocessing after adding test phrase:  lol
