# Test of Standard NLP tasks with HR from 1866

After correcting the OCR for HR18660801-V01-01, I am testing the performance of different nlp tasks and libraries on the hand-typed and OCRed texts. At time of work, I had not verified the manually corrected text, so there are likely transcription errors remaining. However, all layout and major recognition errors have been addressed.

For this test, I am going to run through a few standard NLP tasks:

- POS tagging
- Entity Tagging
- Tokenization:
    - sentence
    - word
    - phrases
- lemmatization

Also hope to test with a LM such as BERT.


First library to test will be SpaCy, using the large language model.

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_lg")

Load corrected texts from data directory

In [49]:
import os

data_dir = "/Users/jeriwieringa/Documents/Research/ocr-and-nlp/data/text/"

files_list = os.listdir(os.path.join(data_dir, "manually-corrected"))

documents = {}

for filename in sorted(files_list):
    if filename.startswith("HR1866"):
#         print(filename)
        with open(os.path.join(data_dir, "manually-corrected", filename), "r") as f:
            content = f.read()
        documents[filename] = content

In [50]:
documents['HR18660801-V01-01-p1.txt']

'# THE Health Reformer. \nOUR PHYSICIAN, NATURE: OBEY AND LIVE.\nVOL. 1. BATTLE CREEK, MICH., AUGUST, 1866. NO. 1. \nTHE HEALTH REFORMER, \nPUBLISHED MONTHLY AT\nThe Western Health-Reform Institute\nBATTLE CREEK, MICH.,\nH. S. LAY, M.D., EDITOR\nTerms: One Dollar per Year, invariably in Advance.\nAddress Dr. H. S. LAY, Battle Creek, Michigan.\n\n## Original Articles.\n\n## DIGESTION.\n## BY J. H. GINLEY, M.D.\n\nDigestion is that process by which\nfood is reduced to a form in which it can\nbe absorbed and taken up into the blood.\nThis is the way that food builds up the \nwaste constantly going on in the body. \nThis process is accomplished, 1st. By the \nteeth. 2nd. By the saliva. 3d. By the \nmucous membrane. 4th. By the gastric  \njuice. 5th. By the bile; and 6th. By \nthe pancreatic juice. \n\nAs the food consists of a mixture of\nvarious kinds, having different physical \nand chemical properties, so also these \npeculiar fluids differ in kind, quality, and\nchemical action, from e

Demo with one document

In [63]:
doc = nlp(documents['HR18660801-V01-01-p1.txt'])

In [64]:
for entity in doc.ents:
    print(entity.text, entity.label_)

VOL. DATE
1 CARDINAL
BATTLE CREEK LOC
MICH GPE
AUGUST DATE
1866 DATE
1 CARDINAL
MONTHLY DATE
The Western Health-Reform Institute ORG
BATTLE CREEK GPE
MICH GPE
H. S. LAY PERSON
M.D. GPE
One Dollar MONEY
H. S. LAY PERSON
Battle Creek GPE
Michigan GPE
## DIGESTION MONEY
## MONEY
M.D.

 ORG
1st DATE
2nd DATE
3d CARDINAL
4th ORDINAL
5th ORDINAL
6th ORDINAL
three CARDINAL
four CARDINAL
two CARDINAL
œsophagus GPE


In [65]:
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['THE Health Reformer', 'OUR PHYSICIAN', 'NATURE', 'VOL.', 'BATTLE CREEK', 'MICH.', 'THE HEALTH REFORMER', 'PUBLISHED MONTHLY', 'The Western Health-Reform Institute\nBATTLE CREEK', ',\nH. S. LAY', 'EDITOR', 'Terms', 'One Dollar', 'Year', 'Advance', 'Address', 'Dr. H. S. LAY', 'Battle Creek', '## Original Articles', '## DIGESTION', 'J. H. GINLEY', 'M.D.', 'Digestion', 'that process', 'food', 'a form', 'it', 'the blood', 'the way', 'food', 'the \nwaste', 'the body', 'This process', 'the \nteeth', 'the saliva', '3d', 'the \nmucous membrane', 'the gastric  \njuice', 'the bile', '6th', 'the pancreatic juice', 'the food', 'a mixture', 'various kinds', 'different physical \nand chemical properties', 'these \npeculiar fluids', 'kind', 'quality', 'chemical action', 'the food', 'the intestines', 'those parts', 'it', 'a liquid \nstate', 'absorption', 'the blood', 'the remaining portion', 'the intestinal \nsecretions', 'a firmer consistency', 'the absorption', 'the fluids', 'detrite 

In [66]:
doc_verbs = [token.lemma_ for token in doc if token.pos_ == "VERB"]

In [67]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

# # PUNCT NFP ROOT # False False
THE the DET DT det XXX True True
Health Health PROPN NNP compound Xxxxx True False
Reformer Reformer PROPN NNP pobj Xxxxx True False
. . PUNCT . punct . False False

 
 SPACE _SP  
 False False
OUR our DET PRP$ poss XXX True True
PHYSICIAN physician NOUN NN ROOT XXXX True False
, , PUNCT , punct , False False
NATURE nature NOUN NN appos XXXX True False
: : PUNCT : punct : False False
OBEY obey VERB VB appos XXXX True False
AND and CCONJ CC cc XXX True True
LIVE live VERB VB conj XXXX True False
. . PUNCT . punct . False False

 
 SPACE _SP  
 False False
VOL VOL PROPN NNP dep XXX True False
. . PROPN NNP ROOT . False False
1 1 NUM CD ROOT d False False
. . PUNCT . punct . False False
BATTLE BATTLE PROPN NNP compound XXXX True False
CREEK CREEK PROPN NNP ROOT XXXX True False
, , PUNCT , punct , False False
MICH MICH PROPN NNP dep XXXX True False
. . PROPN NNP appos . False False
, , PUNCT , punct , False False
AUGUST AUGUST PROPN NNP npadvmod XXXX True F

kind kind NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
quality quality NOUN NN conj xxxx True False
, , PUNCT , punct , False False
and and CCONJ CC cc xxx True True

 
 SPACE _SP  
 False False
chemical chemical NOUN NN compound xxxx True False
action action NOUN NN conj xxxx True False
, , PUNCT , punct , False False
from from ADP IN prep xxxx True True
each each DET DT det xxxx True True
other other ADJ JJ pobj xxxx True True
. . PUNCT . punct . False False


 

 SPACE _SP  

 False False
As as SCONJ IN mark Xx True True
the the DET DT det xxx True True
food food NOUN NN nsubj xxxx True False
passes pass VERB VBZ advcl xxxx True False
through through ADP IN prep xxxx True True
the the DET DT det xxx True True
intestines intestine NOUN NNS pobj xxxx True False

 
 SPACE _SP  
 False False
from from ADP IN prep xxxx True True
above above ADP IN advmod xxxx True True
downward downward ADV RB pobj xxxx True False
, , PUNCT , punct , False False
those those DET DT det xxx

given give VERB VBN advcl xxxx True False
a a DET DT det x True True
brief brief ADJ JJ amod xxxx True False
description description NOUN NN dobj xxxx True False

 
 SPACE _SP  
 False False
of of ADP IN prep xx True True
the the DET DT det xxx True True
digestive digestive ADJ JJ amod xxxx True False
organs organ NOUN NNS pobj xxxx True False
, , PUNCT , punct , False False
in in ADP IN prep xx True True
order order NOUN NN pobj xxxx True False
to to PART TO aux xx True True
better well ADV RBR advmod xxxx True False

 
 SPACE _SP  
 False False
understand understand VERB VB acl xxxx True False
their -PRON- DET PRP$ poss xxxx True True
uses use NOUN NNS dobj xxxx True False
, , PUNCT , punct , False False
let let VERB VB ROOT xxx True False
us -PRON- PRON PRP nsubj xx True True
glance glance VERB VB ccomp xxxx True False
at at ADP IN prep xx True True
the the DET DT det xxx True True

 
 SPACE _SP  
 False False
very very ADV RB advmod xxxx True True
interesting interesting ADJ JJ amo

In [68]:
from spacy import displacy

In [70]:
for sent in list(doc.sents):
    displacy.render(sent, style="dep")

Comparison to the OCR version of the document

In [55]:
files_list = os.listdir(os.path.join(data_dir, "ocr-generated"))

orig_documents = {}

for filename in sorted(files_list):
    if filename.startswith("HR1866"):
#         print(filename)
        with open(os.path.join(data_dir, "ocr-generated", filename), "r") as f:
            content = f.read()
        orig_documents[filename] = content

In [71]:
doc = nlp(orig_documents['HR18660801-V01-01-p1.txt'])

In [72]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Jcformcr ORG
1 CARDINAL
BATTLE CREEK LOC
MICH GPE
AUGUST DATE
11166 DATE
1 CARDINAL
H. S ORG
~.:I NORP
ED PERSON
One CARDINAL
H. 8 PERSON
LA Y1 llattle Creak ORG
l1avc four ORG
two CARDINAL
mimals PERSON
The alilllentary canal ORG
municate PERSON
ophagus PERSON
1~1 CARDINAL
1nto CARDINAL
ilds PERSON
ces · PERSON
hed ORG
1st DATE
2nd DATE
~~cous CARDINAL
4t.h CARDINAL
5th ORDINAL
6th ORDINAL
JUice ORG
4s CARDINAL
vanous kmds PERSON
jt PRODUCT
inte~-larger PRODUCT
ltrong muscular PERSON
togethei NORP
acquireR GPE
detrite materi PERSON
al PERSON
venuiculat PERSON
three CARDINAL
1 CARDINAL
Ii Ii II Ii PRODUCT


In [73]:
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['THE Jcformcr', 'Ł', 'OUR PHYSiC1AN', 'NATURE', '_', '_', '_', 'VOL', 'BATTLE CREEK', 'MICH.', 'THE HEALTH REFORMER', 'PUl!ILISUED', 'Mff', '~TllLY AT', 'mut', 'Jra1tu-~rtarm ~nstitutc, RATTJ.K CRi,:EK, iHICH., H. S', '_ LAY1', 'l., ED', 'I', 'TOE', 'One Dnlh', "r per Ye1u1 tnTa.l'iRhly", 'Advance', 'Dr. H.', 'LA Y1 llattle Creak', '%', 'higher order', 'man', 'the head', 'minrn', 'l creation', 'a great \nsimilarity', 'the animal \ndigestive system', 'man', 'the human species', 'the digestive apparatus', 'it', 'The alilllentary canal', "ditl'ernut cavities", 'coru-\n(0 ti 9 inn l', 'municate', '~................................................................ openings', 'DIGESTION', 'its commencement', 'we', 'the cav-', 'the mouth', 'its posterior 11Y J.', 'Gc11LH', 'P. extremity', 'a muscular valve', 'isthmus', 'the \nfauces', 'the ces-', '~ess ~y ~h1ch, ophagus', 'it', 'tho \nfood', 'a form 1~1 wlnch 1t', 'either extremity', 'the blood', 'circular folds', 'muscular fibr

In [74]:
orig_doc_verbs = [token.lemma_ for token in doc if token.pos_ == "VERB"]

In [75]:
for sent in list(doc.sents):
    displacy.render(sent, style="dep")

Let's compare the set of identified verbs

In [60]:
# https://www.geeksforgeeks.org/python-difference-two-lists/
def Diff(li1, li2):
    return list(set(li1) - set(li2)), list(set(li2) - set(li1))

In [61]:
Diff(doc_verbs, orig_doc_verbs)

(['consist',
  'acquire',
  'ascend',
  'build',
  'accomplish',
  'lodge',
  'communicate',
  'publish',
  'divide'],
 ['commumcate',
  'asccn1le',
  'e',
  'rent',
  'diviue',
  'hed',
  'be',
  "t'i",
  'ease',
  'waste'])

One of the strategies for dealing with messy OCR is to select only the nouns for tasks such as topic modeling.