OCR'd texts present special challenges to tokenization.  Consider this selection from an OCR'd version of Darwin's Origin of Species from the [Internet Archive](https://archive.org/download/originofspecies00darwuoft/originofspecies00darwuoft_djvu.txt):

```
the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may
```

Here the printing convention of line-break hyphenization would, under a standard tokenizer, generate incorrect tokens like `interest-ing` (or perhaps `interest-` and `ing`).  Design a better tokenizer (even just using pre- and post-processing) for these texts.  Note here the correct tokenization of `interest-ing` is `interesting` but the correct tokenization for `newly-formed` is still `newly-formed`.

For a more thorough library for handling OCR'd book data, see https://github.com/tedunderwood/DataMunging


In [1]:
import sys, nltk, re

In [8]:
def read_text(filename):
    lines=[]
    with open(filename) as file:
        for line in file:
            lines.append(line.rstrip())
    return lines

def read_tex_2(filename):
    with open(filename) as file:
        lines = [line.rstrip() for line in file]
    return lines

In [9]:
filename="../data/darwin_origin_ia.txt"

In [11]:
lines=read_tex_2(filename)

In [20]:
testText="""the inhabitants of the surrounding districts will, also, be thus
prevented. Moritz Wagner has lately published an interest-
ing essay on this subject, and has shown that the service
rendered by isolation in preventing crosses between newly-
formed varieties is probably greater even than I supposed.
But from reasons already assigned I can by no means agree
with this naturalist, that migration and isolation are neces-
sary elements for the formation of new species. The im-
portance of isolation is likewise great in preventing, after
any physical change in the conditions such as of climate ele-
vation of the land, &c., the immigration of better adapted or-
ganisms; and thus new places in the natural economy of the
district will be left open to be filled up by the modification of
the old inhabitants. Lastly, isolation will give time for a
new variety to be improved at a slow rate ; and this may"""

In [16]:
# Dictionnaire de mots anglais
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     /Users/pierrejaumier/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [17]:
"interest-ing" in brown.words()

False

In [18]:
"natural" in brown.words()

True

In [22]:
# Expression régulière pour trouver les tirets en fin de ligne https://regexr.com/
import re
pattern = re.compile(r'\S*-\n\S*')
items = pattern.findall(testText)
items

['interest-\ning',
 'newly-\nformed',
 'neces-\nsary',
 'im-\nportance',
 'ele-\nvation',
 'or-\nganisms;']

In [63]:
tokenized_words = []
for item in items:
    left_word, right_word = tuple(item.split('-\n'))
    if left_word in brown.words() and right_word in brown.words():
        # keep it as word with hyphen
        tokenized_word = left_word + '-' + right_word
    else:
        # remove hyphen
        tokenized_word = left_word + right_word
    tokenized_words.append(tokenized_word)
        
tokenized_words

['interesting',
 'newly-formed',
 'necessary',
 'importance',
 'elevation',
 'organisms;']