# Report on spell-check quality and possibilities for future improvements

## Code to generate the reports


In [10]:
import pandas as pd
pd.options.display.max_rows = 10000
import os

In [2]:
dir_ = '/Users/jeriwieringa/Dissertation/drafts/data/spelling-statistics/round4/'

In [3]:
def results_to_df(title):
    for filename in os.listdir(dir_):
        if filename.endswith("{}.txt".format(title)):
            print("File: {}".format(filename))
            df = pd.read_csv(dir_ + filename)
            df['word_length'] = df['spell_error'].str.len()
            print("{} has {} rows".format(title, len(df.index)))
    return(df)

def query_df(df, count_min, length_min, sort_by):
    return(df.query('count > {} & word_length > {}'.format(count_min, length_min)).sort_values(sort_by, ascending=False))

## General Observations

- frequent problem of hyphenated words compressed into single token (turns out this was a regex problem created during cleaning). Will need to verify if this is still persistent after running on corpus without "preliminary cleaning"
- Some corpus specific vocabulary and regex needs. 
- 

# ADV (The Advocate)

## General Information

- Date range: 1898 - 1905
- Publication cycle: Monthly
- Layout notes: Early issues are single column. Moves to a two column layout in 1900.
- Publisher: Battle Creek College
- Topics: Education, Missions

Due to the layout of this particular title, there is a high prevalence of what I refer to as "split words," where words are split across two tokens because the OCR engine did not recognize dashes at the end of words. For this particular title, I will take additional cleaning steps to match the most common orphaned endings with their proceeding token. 

## Spell Error Report

In [4]:
# Errors drawn from round3 list

query_df(results_to_df('ADV'), 4, 2, "count")

File: 2016-12-08-Spelling-Errors-ADV.txt
ADV has 17407 rows


Unnamed: 0,spell_error,count,word_length
6805,tion,807,4
345,dren,329,4
8406,chil,326,4
4459,educa,323,5
13967,ment,304,4
8620,n't,234,3
5858,ers,208,3
3554,tions,202,5
8709,edu,175,3
10179,pre,174,3


## Next Steps

The biggest challenge with the Advocate is in the failure of the OCR to pick up line dashes at line endings. This occurs in both formats of the periodical.


### Words to add to supplemental list

- moines
- god’s
- springdale
- postoffice
- teachers’
- heafford
- boggstown
- arithmetics
- avondale
- cowles
- kinley
- lord’s
- campmeeting
- christ’s
- melancthon
- sanitas
- yawman
- duncombe
- children’s
- people’s
- smithfield
- juniata
- publishers’
- upbuilding
- teacher’s
- missoula
- colporteurs
- holyoke
- martius
- augsbourger
- nature’s
- wittemberg
- johanne
- child’s
- bozeman
- year’s
- abraham’s
- instrumentalities
- eastport
- mother’s
- joseph’s
- preceptress
- gifford
- nurses’
- marilla
- leadings
- cloverdale
- montavilla
- gravelford
- edgewood
- melanchthon


### Title specific errors to correct

Initial thought: To fix as many of the split words as possible, I will proceed by taking the most common orphaned endings and checking if a word can be created by combining with the proceeding token. The script to fix the errors would find the orphaned ending, identify the preceding token, combine the tokens, check if resulting token is in dictionary, and if so, replace the two tokens with the new token. 

A less manual approach would be to perform this check for every token that is not in the dictionary ...



# AmSn (American Sentinel | The Sentinel of Liberty)

## General Information

- Date range: 1886-1900
- Publication cycle: Begins monthly. Switches weekly in 1889.
- Layout notes: Begins with three column layout. Switches to two columns in 1897.
- Publisher: Pacific Press Publishing Company | International Religious Liberty Association
- Topics: Sabbath, Legal, Religious Liberty



In [5]:
# Errors drawn from round3 list

query_df(results_to_df('AmSn'), 4, 2, "count")

File: 2016-12-08-Spelling-Errors-AmSn.txt
AmSn has 61743 rows


Unnamed: 0,spell_error,count,word_length
22881,n't,2138,3
20525,'the,431,4
5986,tion,317,4
12689,seventhday,258,10
30463,indorsed,250,8
43697,'of,232,3
30788,satolli,230,7
14935,employes,208,8
57591,munn,206,4
7727,'ll,200,3


### Strategies to Pursue


### Words to add to supplemental list

- deseret
- indorses
- magna
- indorsing
- intrusted
- attaches
- endeavorer
- palmeter
- indorse
- indorsed
- kinley
- indorsements
- indorsement
- endeavorers
- romanists
- unsectarian
- indorsements
- unscriptural
- covenanters
- vaticanism
- theocratical
- artotypes
- habeas
- christianized
- intermeddling
- jesuitical
- memorialists
- despotisms
- puritanic
- intrenched
- christianizing
- mohammed
- amendmentists
- sarnia
- paulist
- phariseeism
- saloonists
- bourgois
- colporter
- sanhedrim
- spurgeon
- romeward
- ascendency
- maximus
- sabbatarianism
- sabbathism
- venders
- moslems
- powderly
- legislationists
- desecrators
- usurpations
- instrumentalities
- sabbatic
- reconcentrados
- churchism
- incapacitations
- colporters
- imprimerie
- theocrats
- groundlessness
- pretentions
- universalists
- melancthon
- chalcedon
- brownist
- officio
- adventism
- ammonius
- merchandize
- zaragoza
- apostolical
- melchisedek
- christless
- clamorers
- unsanctified
- sabbathless
- advertized
- constantius
- supposable
- salvationists
- churchianity
- assumptionist
- martyrdoms
- saturdarians
- desecrations
- mahometan
- reductio
- sanitas
- saloonist
- hibernians
- whitehouse
- torrey
- absurdum
- seraphim
- obiter
- episcopus
- showeth
- unfallen
- charlestown
- plottings
- martialed
- paganized
- peoples'
- memoranda
- discriminations
- andover
- alleghany
- prohibitum
- mahometans
- controversialists
- beneficient
- mussulmans
- judaizing
- scripta
- liberalists
- worldlings
- embassadors
- monsignori
- buddhistic
- archambault
- intoleration
- supremest
- donatists
- republica
- mortem
- papistic
- judaistic
- dunkards
- unbiblical
- severities
- sophisms
- saturdarian
- schuette
- domine
- sigourney
- ultramontanes
- mahometanism
- vanderbilts
- pomeroy
- shangti
- coronate
- russellville
- cocoanut
- comprehendeth
- embroglio
- ballantine
- ansonia
- silverman
- ames
- leadville
- eleusinian
- unfrequently
- indulgencies
- mormondom
- witherspoon
- honoreth
- childeric
- peligious
- wagonettes
- legum
- brahmins
- sabbathlessness
- maclean
- unbloody
- deritend
- holyoke
- hillsboro
- dauchy
- medicaments
- pedobaptist
- jesus'
- colporteurs
- hominem
- majestrates
- cortege
- worshipeth
- ottowa
- religionis
- dauphiny












### Title specific errors to correct

# ARAI (Advent Review Anniversary Issues)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [20]:
query_df(results_to_df('ARAI'), 4, 2, "count")

File: 2016-12-08-Spelling-Errors-ARAI.txt
ARAI has 507 rows


Unnamed: 0,spell_error,count,word_length
162,rockyhill,9,9
467,stowell,8,7
226,k'o,7,3
455,cheo,7,4
341,parana,6,6
453,friedenstal,6,11
476,sha,6,3
119,chitonga,5,8
295,mch,5,3
409,nyassa,5,6


## Report

# CE (Christian Education)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [19]:
query_df(results_to_df('CE'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-CE.txt
CE has 8808 rows


Unnamed: 0,spell_error,count,word_length
8780,n't,727,3
195,'ll,162,3
7207,manumental,133,10
1408,kibbin,56,6
7387,adelphian,52,9
236,'ve,52,3
6995,tion,45,4
2986,millis,44,6
4171,tis,43,3
5404,'re,42,3


## Report

There are a number of conjunction ending that have been split off into their own token (727 for n't). This appears to be a feature of the default NLTK tokenzier, which operates on punctuation. The value here is that end of line punctuation are taken care of by the tokenizer. To avoid reporting the conjunction endings as errors, I am adding them to the SDA word list, even though this could mask some OCR noise.


### Words for SDA Vocabulary List

- n't
- 'll
- 've
- 're
- tis
- manumental
- adelphian
- nurses'
- wyclif
- maplewood
- lippincott
- preceptresses
- syllabi
- hillcrest
- nonessentials
- workers'
- impartation
- acquirements
- arousement
- godlikeness
- pedler
- fangled
- crayolas
- buena
- heartedly
- reenforced
- fomentations
- kindergartner
- literatures
- memoriam
- unpedagogical
- homiletical
- teachers'
- 'twould
- sidewise
- wheatless
- postum
- excellences
- untechnical
- kindergartners
- inclosure
- curriculums
- learnable
- pyrographic
- manumentals
- unruled
- postals
- reenforcement

# CUV (Columbia Union Visitor)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [8]:
query_df(results_to_df('CUV'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-CUV.txt
CUV has 62824 rows


Unnamed: 0,spell_error,count,word_length
13919,ppf,2144,3
25398,'the,510,4
16367,brownlee,459,8
20052,chas,446,4
10106,sabbathschool,362,13
20788,luzerne,361,7
44554,'of,332,3
13057,seventhday,324,10
46923,reichenbach,312,11
2780,elphatrick,307,10


## Report

'PPF' is an abbreviation for "Past, Present, and Future" by James Edson White.

Whomever edited this particular periodical was a wretched speller. As there are a number of common spelling errors, it may be useful to use pyenchant on this particular title to correct the obvious misspellings, so that the data from here matches the other periodicals. Without that, the OCR error rate on this title will be artificially high. And it will be difficult to connect topics here with other periodicals.

It may also be worth doing Named Entity Extraction in an effort to further build up a database of names.
- I examined this possibility, but this is a very computationally slow process and results in a lot of noise.


### Words to add

- ppf
- yingling
- fauquier
- enroute
- roumanian
- middlesex
- culpeper
- rappahannock
- districted
- sligonian
- colportage
- urbana
- vandergrift
- smithburg
- bereans
- berean
- dunkard
- movings
- hollenbaugh
- ephrata
- backslidden
- punxsutawney
- roumanians
- waynesboro
- drummond
- williamstown
- doylestown
- nelsonville
- steubenville
- fredericktown
- sanitorium
- elmshaven
- nutmeato
- canonsburg
- ashville
- bursted
- hackettstown
- liberalities
- roanoak
- districting
- coopersburg
- rebaptized
- santo
- beavertown
- cannonsburg
- wheaton
- antitypical
- enablings
- unchristlike
- annum
- spotsylvania
- szechuan
- emporia
- burlingame
- respector
- russellism
- revelator
- summerset
- deerfield
- wapakoneta
- idlewood
- vanderhook
- clarksville
- cantwell
- meacham
- stereoptican
- anacostia
- christlikeness
- disfellowshipped
- prospectives
- bridgeville
- cloverfield
- scottsburg

# EDU (The Christian Educator)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [18]:
query_df(results_to_df('EDU'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-EDU.txt
EDU has 3028 rows


Unnamed: 0,spell_error,count,word_length
315,sloyd,29,5
2166,bamberger,13,9
2471,tion,12,4
1920,salomon,11,7
364,dep't,11,5
1125,pre,10,3
2519,abrahamson,9,10
924,pub'g,8,5
959,'the,8,4
790,ment,7,4


## Report
### Words to add

- boettger
- whatley
- vergil
- perlen
- morrill
- salomon
- sloyd
- bamberger
- don'ts
- cygnaeus
- cannel
- watrous
- anchoret
- triticum
- chamberlin
- sewall
- emeline
- gemmules
- swinton
- nephesh
- sherringham	
- achroodextrin
- granose
- erythrodextrin
- leucocytes
- orthoepist
- brownell
- abrahamson
- dep't

# GCB (General Conference Session Bulletins)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [17]:
query_df(results_to_df('GCB'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-GCB.txt
GCB has 43965 rows


Unnamed: 0,spell_error,count,word_length
24302,tion,679,4
40181,gcs,436,3
38395,ence,346,4
36370,'the,338,4
42459,ference,289,7
41860,ment,240,4
37146,'of,232,3
31501,ple,186,3
24655,sabbathschool,179,13
26374,'to,166,3


## Results

It appears that this title also has some burst word problems, with common suffixes appearing in the errors list.

"GCS" is a header that appears on each page of the transcribed session bulletins. It is not an ocr error, but it is not a word either.

### Words to add

- hildebran
- unworked
- incorporations
- deutsche
- zurich
- somabula
- polyglotte
- tsungwesi
- materia
- catharina
- unreprovable
- medica
- cafes
- apostacy
- harkened
- unblamable
- barbadoes
- boggs
- concordia
- acetanilid
- baumann
- frederickson
- moffat
- arbeiter
- habitants
- breckinridge
- christofferson
- stromsburg
- malarious
- neuchatel
- pawtucket
- cocoanuts
- morphin
- modale
- embassador
- gcs
- amens


# GH (The Gospel Herald)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [11]:
query_df(results_to_df('GH'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-GH.txt
GH has 26797 rows


Unnamed: 0,spell_error,count,word_length
26162,smouse,177,6
26702,'the,153,4
24283,schramm,113,7
268,thot,107,4
16496,'of,104,3
18531,jno,99,3
22762,chas,92,4
20048,lintonia,82,8
5571,'to,75,3
3106,tion,75,4


## Report

This one also seems to have some spelling trouble, similar to the Columbia Union Visitor. Check editor?

### Words to add

- schramm
- smouse
- lintonia
- wilsonia
- spartanburg
- corsicana
- samuels
- sturgis
- connoughay
- maclaren
- westerfield
- opposers
- culpepper
- unhelped
- antediluvians
- tonsilitis
- allendale


# GOH (The Gospel of Health)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [16]:
query_df(results_to_df('GOH'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-GOH.txt
GOH has 2783 rows


Unnamed: 0,spell_error,count,word_length
1063,nuttose,51,7
1553,bromose,24,7
2597,abbie,20,5
1316,nuttolene,19,9
71,lauretta,18,8
796,protose,14,7
336,lenna,13,5
1320,mackey,12,6
1527,pel,10,3
2288,tion,10,4


## Report

### Words to add

- nuttose
- bromose
- nuttolene
- lauretta
- protose
- drs
- bouchard
- proteids
- croutons
- comfortables
- strychnin
- dulness
- albumins
- vomica
- excrementitious
- schillembeck
- hyperpepsia
- unemulsified
- albumens
- fairhaven

# GS (Gospel Sickle)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [13]:
query_df(results_to_df('GS'), 4, 2, "count")

File: 2016-12-09-Spelling-Errors-GS.txt
GS has 18810 rows


Unnamed: 0,spell_error,count,word_length
6257,'the,181,4
4423,'of,124,3
12013,aro,111,3
4173,eze,75,3
7908,'to,64,3
15169,'and,63,4
12885,ile,55,3
15862,pre,50,3
13040,ots,47,3
13239,tion,45,4


## Report

Overall, this title has a lot of errors per page, though the pages are also on average double the 1000 word goal. 

There are many errors raised by over-abundant "'" marks. This title may benefit from a closer evaluation of that OCR pattern. The scans of this title have a lot of image noise.

There are also a lot of split words, so it may be worth applying a similar strategy of matching errors to surrounding tokens.

"elds" as an abbreviation is short for elders.

### Words to Add

- elds
- heylyn
- mosheim
- blest
- schaff
- shabbath
- northfield
- abrahamic
- unintoxicating
- overcomers
- decretals
- fulfillments
- sabbatism
- winona
- brainerd
- unmingled
- owatonna
- hackett
- prideaux
- aubigne
- uncandid
- sankey

# HM (The Home Missionary)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [14]:
query_df(results_to_df('HM'), 4, 2, "count")

File: 2016-12-12-Spelling-Errors-HM.txt
HM has 9990 rows


Unnamed: 0,spell_error,count,word_length
5175,gen'l,102,5
6604,durand,97,6
5407,rep't,92,5
435,miscel,79,6
8413,am't,76,4
2387,mis,76,3
6180,avenola,76,7
4402,l't,64,3
3674,cumb,62,4
7453,imlay,57,5


## Report

There are a number of "report" issues in this periodical with many people and place names. This will inflate the OCR error rate, both because these tend to be printed in a small and compact font, and because many of the names are missing from the word lists.

There are also a number of spelling errors in the original document (ie. hinderances). 

This title also makes liberal use of abbreviations:

- "gen'l" is an abbreviation for "general"
- "rep't" is an abbreviation for "report"
- "am't" is an abbreviation for "amount"
- "l't" is part of an abbreviation ("l't'd") for "limited" on the train schedules. (not included as this is too broad)
- "dist's" for districts
- "deliv'd" for delivered
- "susp'n" for "Susp'n Bridge" (unsure what "susp'n" is for)


### Words to Add

- gen'l
- rep't
- am't
- durand
- wheatena
- avenola
- cassopolis
- lehigh
- bogota
- bordoville
- grandville
- farmington
- grinnell
- elkhorn
- pierson
- centerville
- pierrepont
- springside
- lakeview
- castana
- scottville
- wamego
- sandyville
- saranac
- webberville
- vermontville
- sinclairville
- evangeliets
- sextonville
- ladonia
- grangeville
- sedalia
- leesburg
- harrisonville
- mannsville
- marshalltown
- winterset
- kankakee
- windham
- pataha
- demerara
- slocumville
- moravians
- ogdensburg
- vacaville
- graettinger
- richland
- oceanica
- rurutu
- nortonville
- pemberville
- gouverneur
- recanvassing
- navasota
- milbank
- maquoketa
- pendleton
- litchfield
- paulsboro
- clairsville
- antigo
- olivett
- singalese
- osawkee
- kewanna
- wilsonville
- neodesha
- bellaire
- barberville
- stambaugh
- carrollton
- delphos
- worthington
- lovington
- chitwood
- pittwood
- recanvass
- marquam
- eldorado
- spartansburg
- susanville
- texarkana
- scofield
- oronoque
- hawleyville
- buckland
- edenboro
- escanaba
- messrs
- remedios
- albina
- revell
- bloomsburg
- unevangelized
- westford
- salverda
- kennard
- heppner
- kingsbury
- matherton
- clearmont
- berbice
- norristown
- peterboro
- vryburg
- nestorians
- depositaries
- hasheesh
- salemville
- nordland
- vreeland

# HR (The Health Reformer)

## General Information

- Date range: 1866-1907\*
- Publication cycle: Monthly. Initially, July to June. Switches to calendar year for the volume in 1872.
- Layout notes: predominantly 2 columns. Number of ads increases over time.
- Publisher: Good Health Publishing Company
- Topics: 

\* J.H. Kellogg continues the publication of *The Health Reformer* after 1907 but it no longer as a denomination sanctioned publication. 

Title changes to *"Good Health: A Journal of Hygiene"* in 1879.

Run for 1886 is incomplete.

Covers missing for 1897 - 1899.

The original dates for 1890 were given incorrectly as 1899 in the file name and need to be corrected in the corresponding text files.


In [15]:
query_df(results_to_df('HR'), 4, 2, "count")

File: 2016-12-12-Spelling-Errors-HR.txt
HR has 94059 rows


Unnamed: 0,spell_error,count,word_length
28466,tion,950,4
20855,sel,633,3
72492,cafe,595,4
15620,sitz,460,4
50304,ment,445,4
31479,pre,423,3
61348,proteid,417,7
45282,hydrozone,266,9
2316,tions,265,5
76552,glycozone,250,9


## Report

- appears to have line-ending problems resulting in split words
- also seem to have words running together


### Words to Add

- cafe
- proteid
- proteid
- hydrozone
- glycozone
- kumyss
- sirup
- tremens
- hypopepsia
- trall
- microscopists
- hygeio
- hydriatic
- gruels
- infantum
- electropoise
- meltose
- derangements
- pharmacal
- drexel
- dextrinized
- caffein
- morbus
- corpore
- enemata
- alabastine
- chautauquan
- sanitaire
- depurating
- innutritious
- sirups
- hygeian
- ptomains
- prolapsus
- opular
- operandi
- antizymotic
- diptheria
- unperverted
- modus
- catarrhs
- caseine
- chamberland
- fitzhugh
- esquimaux
- volatilizer
- congestions
- unstimulating
- tyrotoxicon
- cottolene
- granulations
- depuration
- unphysiological
- cocain
- granuto
- peptogen
- desponding
- fermentations
- degenerations
- mesenteric
- columbias
- herbivora
- indigestibles
- menticulture
- orificialist
- respirations
- ladies'
- maltol
- sulphureted
- palliser
- chlorid
- druggery
- wyckoff
- granut
- fibrine
- granuts
- lacteals
- apepsia
- peptogens
- amylodextrin
- gymnasia
- nicotin
- marchand's
- chlorin
- insalivation
- spiralis
- dextrine
- drugopathy
- unhealthfulness
- germless
- libitum
- farinacea
- medicatrix
- gayety
- alleghanies
- fletcherizing
- peptonized
- sphygmographic
- enteroptosis
- phthisical
- nuttola
- dietetically
- neuralgias
- muddlement
- hydrotherapeutic
- bromfield
- maltine
- unapproached
- unskimmed
- excitants
- unaired
- miasm
- quincey
- unphilosophical
- bionomy
- paraffine
- cerealine
- brahmin
- costiveness
- empirics
- decollete
- antidoting
- cuticura
- extractives
- chromophilic
- chemico
- hindostan
- chymification
- wheatose
- physiologies
- bichlorid
- gormands
- nebulizing
- emunctories
- subcarbonate
- micrococci
- trichinous
- antidoted
- crawfish
- cundurango
- dryest
- celsus
- unirritating
- sulphuretted
- keichline
- overtasked
- boyesen
- faradization
- tumblerful
- grandeurs
- pectoris
- tracted
- edinburg
- coloclyster
- medicale
- fetich
- nitrogenized
- chilian
- l'hommedieu
- hydatids
- lippincott's
- gesenius
- sanitaria
- vaccinators
- expectorations
- trichiniasis
- oolong
- wickless
- measle
- priessnitz
- natitorium
- similia
- hayem
- oblongata
- comedones
- palatableness
- ethylic
- ozonifying
- hydropathists
- shachar
- desquamation
- quinia
- bovinine
- insalivated
- infusoria
- habitue
- conventionalities
- incumbrance
- resinol
- rheotome
- indica
- navaho
- laplanders
- scabiei
- allopaths
- chlorodyne
- potentized
- gayly
- flourens
- fuchsine
- castoria
- billiousness
- rheumatisms
- putrefactions
- leggett
- macrophags
- exudations
- incloses
- trowsers
- patagonians
- theobromine
- chirurgical
- manioca
- fuego
- aquae
- thermophore
- dissipations
- chlorophyl
- uncared
- masticators
- sacchari
- potencies
- moros
- celestials
- hamamelis
- atheromatous
- essenes
- amination
- chlorotic
- similibus
- drugopathic
- dropsies
- promethea
- erysipelatous
- adhesions
- dyspepsy
- borated
- souffle
- gregor
- invigorator
- ischomachus
- kiniesitherapy
- repapered
- analagous
- cholesterine
- rheumatiz
- zealanders
- toilettes
- aneurism
- wyandottes
- aphthous
- hydrocyanic
- corea
- sicklied
- dioxid
- brainard
- miasms
- twistings
- violacea
- opsonic
- epigastric
- curantur
- foolometer
- mammalia
- twitchings
- medicus
- excrements
- butterine
- dextrous
- simples
- porridges
- priori
- cholagogues
- homeopathists
- undepraved
- pancrobilin
- pneumonias
- centenarianism
- leucorrhcea
- pruritis
- danielites
- sanctities
- inclemencies
- winterless
- quadrumana
- gothard
- demodex
- hygeiana
- exophthalmic
- ganglions
- cachectic
- glyeozone
- minnehaha
- leucorrhea
- culturist
- adolphus
- trichinatous
- repinings
- mementoes
- pietra
- pythagoreans
- aerotherapy
- sensualities
- unamiable
- diagnosticate
- unhealthfully
- magnus
- physica
- habitus
- empyreumatic
- mercurials
- rheuma
- alboline
- substantials
- diptheretic
- cancellous
- superioris
- unswept
- grahamism
- sacculated
- inspissated
- sesamoid
- frugivora
- dextrinization
- insipidus
- devastations
- lanoline
- formaldehyd
- fashionables
- crystalized
- uninflammable
- contaminations
- electrotypers
- borealis
- pulseless
- healthward
- sulphuret
- undescribed
- orificialists
- venoms
- orangoutang
- unremitted
- megatherium
- dietetical
- peptogenic
- entrees
- croustades
- servetus
- chromolithographic
- reveillon
- ginghams
- diastatic
- animalcula
- infinitum
- lazzaroni
- superinduced
- englander
- antitoxine
- monoxid
- etherial
- diarrhceal
- compounders
- tractions
- chautauquas
- cholagogue
- incumbrances
- uncleansed
- matriculants
- orbicularis
- combatted
- unsupplied
- agues
- plethysmograph
- undrained
- menthe
- clamorings
- hyginnicks
- albertus
- invigorators
- mishna
- pleasanton
- tricocephalus
- vermillion
- adornings
- bleedings
- scientifique
- lithograpic
- ralstonites
- leucomains
- orrville
- abercrombie
- invigorant
- dyspepsias
- asthmas
- thompsonianism
- disassimilation
- tuberculated
- dysmenorrhea
- physicial
- rarified
- toxine
- lacto
- kalsomines
- electrotyping
- musonius
- carbolized
- animalculum
- tabacum
- paniers
- origanum
- hospitalities
- piggified
- frijoles
- stillingia
- digestants
- oculum
- motograph
- pomological
- sanscrit
- allopathically
- tuberculation
- hyposulphite
- asculapius
- elegancies
- unmasticated
- peptonoids
- paralyzer
- alteratives
- calcis
- unheeding
- electrolibration
- parenchymatous
- alchohol
- antineuralgic
- tyrosin
- supercede
- membered
- hirschfeld
- mellitus
- gymnotus
- automata
- expirations
- percutient
- triturations
- diarrheas
- ecstacies
- stereopticons
- laundrying
- sozodont
- reprobating
- englanders
- oxyuris
- dispepsia
- francais
- epidemical
- mothers'
- bemuddled
- punctureless
- winnifred
- companionships
- caffeism
- grahamites
- finlanders
- hyperchlorhydria
- diseasing
- miseducation
- publishers'
- sunburnt
- omnivora
- thriftlessness
- peptonizing
- toxicon
- percussions
- lettres
- vermicularis
- poisonopathy
- pyrosoma
- culturists
- deformers
- peaslee
- quackeries
- adulterators
- scollard
- soiree
- vegetus
- crustacea
- bromidia
- esthetical
- lactis
- fatuus
- extinguishment
- undersuits
- pyrogallic
- autoinfectious
- vitrol
- oxygenized
- diarrhcea
- napthol
- jessop
- nicotinized
- overfullness
- typhlitis
- cryptogamic
- unshriveled
- semmola
- incognita
- disinfectors
- sanitarianism
- overdistention
- ipecacuanha
- sulphonal
- somerville
- melograph
- pantalettes
- americanitis
- silicious
- electuaries
- dispositioned
- chantemesse
- paralta
- osmazome
- bollinger
- endometritis
- konigsberg
- vegetarische
- armamentarium
- spitzbergen
- concensus
- unhygienically
- bissell
- polyscope
- phillipsburg
- langford
- danforth
- colics

# IR (Indiana Reporter)

## General Information

- Date range: 1901-
- Publication cycle: bi-weekly
- Layout notes: 3 columns, shifting to 2 in 1902. 
- Publisher: Indiana Association of Seventh-day Adventists
- Topics: camp meetings, local organizations.

The collection held and digitized by the SDA begins with Volume 7.

In [6]:
query_df(results_to_df('IR'), 4, 2, "count")

File: 2016-12-13-Spelling-Errors-IR.txt
IR has 23349 rows


Unnamed: 0,spell_error,count,word_length
2109,tion,359,4
23146,mahan,315,5
8358,ence,177,4
5308,presidentw,175,10
22751,walkerton,167,9
19537,tions,144,5
6118,unionville,143,10
20371,rocklane,140,8
18307,ference,139,7
3,chas,134,4


## Report

Title seems to suffer from separated line endings. It appears that the hyphen on these is captured in the OCR but was not rejoined in the regex. Also, the attempts to split squashed names appears to be causing problems with complex names, such as McMahan. "missionaryr" appears to be an error created when removing the -.

Title is also prone to regular spelling errors.

### Potential title-specific corrections:

- thirdstreetindianapolisindiana (13) -> third street indianapolis indiana
- presidentmorris (13) -> president morris


### Words to Add

- campmeetings
- fredricksburg
- lifegiver
- bookmen

# LB (The Life Boat)

## General Information

- Date range: 1898-1920
- Publication cycle: monthly
- Layout notes: 2-column
- Publisher: Medical Missionary and Benevolent Association
- Topics: health, medical missions

In [6]:
query_df(results_to_df('LB'), 4, 2, "count")

File: 2016-12-30-Spelling-Errors-LB.txt
LB has 25229 rows
File: 2016-12-31-Spelling-Errors-LB.txt
LB has 25227 rows


Unnamed: 0,spell_error,count,word_length
10799,mackey,296,6
14763,hsi,247,3
23192,halsted,165,7
7895,ile,142,3
13506,vitamines,110,9
7689,lundell,97,7
23332,kershaw,95,7
21299,auley,92,5
14910,pearsons,91,8
24764,harner,90,6


## Report

- dashes causing false errors
- borders composed of 'ooooo'
- OCR struggling to recognize latin characters -> pharmacopceia should be pharmacopœia

### Words to Add

- mackey
- vitamines
- cassimeres
- prayerless
- keswick
- fletcherize
- gastro
- demineralized
- unconfessed
- dietitics
- degerminated
- fletcherized
- minnetonka

# LH (Life and Health)

## General Information

- Date range: 1904-1920
- Publication cycle: monthly
- Layout notes: 2-column
- Publisher: Review and Herald Publishing Association
- Topics: health, home

Notes: 

- Switching to examining only those errors that occur 10 times or more.
- Digitized collection begins with volume 19.

In [8]:
query_df(results_to_df('LH'), 9, 2, "count")

File: 2016-12-31-Spelling-Errors-LH.txt
LH has 27244 rows


Unnamed: 0,spell_error,count,word_length
3136,cornforth,267,9
15571,tri,120,3
25076,tion,119,4
7320,nauheim,91,7
4410,antituberculosis,87,16
1492,pre,83,3
8341,'ad,71,3
18562,vitamine,64,8
17815,onehalf,63,7
24188,socalled,62,8


## Report

### Words to Add

- cornforth
- antituberculosis
- vitamine
- quinin
- sanatoria
- drugless
- antityphoid
- nebulizers
- unvaccinated
- frictionary
- peroxid
- pellagrins
- welch's
- canners
- misbranded
- bulgaricus
- oxygenator
- iodin
- misbranding
- vender
- pectose
- intoxications
- antipyrin
- autotoxemia
- hypophosphites
- antiliquor
- lavinder
- liquozone
- antivaccination
- veronal
- phthisiophobia
- chenopodium
- susceptibles
- oxygenor
- uninspected
- starchless
- revaccination
- diaduction
- douch


# LibM (Liberty)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics: 

In [9]:
query_df(results_to_df('LibM'), 9, 2, "count")

File: 2016-12-31-Spelling-Errors-LibM.txt
LibM has 9203 rows


Unnamed: 0,spell_error,count,word_length
412,gallivan,61,8
3949,religio,48,7
1554,miraglia,45,8
6330,tion,43,4
8822,cxsar,40,5
7506,neander,38,7
5181,charta,37,6
2382,ment,32,4
6487,chas,30,4
257,seventhday,29,10


## Report

Publication appears to use a lot of proper names. Will have a higher reported error rate because of that.

### Words to Add

- gallivan
- miraglia
- brevities

# LUH (Lake Union Herald)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [10]:
query_df(results_to_df('LUH'), 9, 2, "count")

File: 2016-12-31-Spelling-Errors-LUH.txt
LUH has 26795 rows


Unnamed: 0,spell_error,count,word_length
7270,vagh,663,4
19685,ords,471,4
11408,drury,455,5
18121,chas,443,4
10254,suda,353,4
19610,shelbyville,284,11
26642,herrin,275,6
6736,conaughey,271,9
22351,kimberlin,266,9
10191,plake,241,5


## Report

Mc\* names have been split.
There are a lot of names in this publication.


### Words to Add

- mcvagh
- zacharius
- eldership


# NMN (North Michigan News)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [11]:
query_df(results_to_df('NMN'), 9, 2, "count")

File: 2017-01-01-Spelling-Errors-NMN.txt
NMN has 9663 rows


Unnamed: 0,spell_error,count,word_length
2957,aro,89,3
9302,leetsville,28,10
5842,willaman,26,8
176,dighton,22,7
3643,evart,21,5
2724,soo,20,3
6990,clellan,19,7
4662,altho,18,5
4149,myrta,18,5
8423,beeler,15,6


## Report

Recurring problem of translating "e"s as "o"s.

### Words to Add


# PHJ (Pacific Health Journal and Temperance Advocate)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [12]:
query_df(results_to_df('PHJ'), 9, 2, "count")

File: 2017-01-01-Spelling-Errors-PHJ.txt
PHJ has 20679 rows


Unnamed: 0,spell_error,count,word_length
697,sel,255,3
19978,ournal,129,6
278,societyl,80,8
16362,munn,73,4
4346,allerton,58,8
15963,misses',56,7
17450,tion,54,4
3863,urnal,51,5
17316,teviperance,50,11
16736,cloe,47,4


## Report

"teviperance" is an error from the page title that has been partially corrected. The original OCR is "TE1VIPERANCE", and the original is "TEMPERANCE".

"sel" is an abbreviation for attributing poems and other miscellanea.

### Words to Add

- misses'
- sitz
- butyric


# PTAR (The Present Truth and Adventist Review)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [4]:
query_df(results_to_df('PTAR'), 4, 2, "count")

File: 2017-01-03-Spelling-Errors-PTAR.txt
PTAR has 5805 rows


Unnamed: 0,spell_error,count,word_length
71,ver,78,3
2050,'the,49,4
3524,ment,46,4
2269,holies,39,6
3671,tion,34,4
5131,'of,23,3
2944,storrs,23,6
5342,eze,17,3
2894,ments,17,5
5260,'to,13,3


## Report

For this title, using 5 as the cut-off because there are so few words.

Appears to have line-ending problems.


### Words to Add

[none]

# PUR (Pacific Union Recorder)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:


In [6]:
query_df(results_to_df('PUR'), 9, 2, "count")

File: 2017-01-03-Spelling-Errors-PUR.txt
PUR has 54012 rows


Unnamed: 0,spell_error,count,word_length
33982,tion,627,4
180,elhany,490,6
46410,seventhday,448,10
53016,ords,407,4
11275,ence,380,4
6552,ment,308,4
48811,chas,304,4
39826,sabbathschool,297,13
31687,pherson,287,7
45912,ference,281,7


## Report

- line ending problems again.

In an attempt to address the placenames problem more systematically, I generated a new word list using data from the USGS with city information for the US. This greatly reduced the number of identified "errors" in this title.

### Words to Add

- sda
- depreciations
- colaborers
- incorporators
- tardinesses


# RH (Review and Herald)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [13]:
query_df(results_to_df('RH'), 24, 2, "count")

File: 2017-01-03-Spelling-Errors-RH.txt
RH has 697431 rows


Unnamed: 0,spell_error,count,word_length
228574,tion,5691,4
528644,'the,5093,4
64390,brn,3962,3
359021,ment,2885,4
385865,pre,2872,3
128283,seventhday,2847,10
601139,chas,2837,4
78614,'of,2796,3
309121,ets,2249,3
291436,eze,2209,3


## Report

- line-endings splitting words again.
- recurring roman numeral error where an "l" is attached to the end of a roman numeral. This may be due to the way I removed punctuation.
- many extra "'"s -- will need a way to remove those

### Words to Add

- medo
- holies
- fillio
- religio
- reichstag
- disfellowshiped
- besetments
- exode
- harken
- herold
- besetment
- antinomianism
- hinderance
- millerism
- espirito
- czechowski
- distributers
- sinaitic
- haskells
- rulership
- klux
- wesleys
- wesleys
- campbellites
- zinzendorf
- shelterless
- dominicum
- disfellowship
- unrepented
- raptured
- unregenerated
- parousia
- sabbatizing
- arimathea
- reenforcements
- interpositions
- judaizers
- hystaspes
- boastingly
- accidently
- prayerfulness
- rejecters
- ruthenians
- unimpressible
- doubtings
- cumberers
- unfoldings
- bickerings
- protectory
- temporalities
- immersionists
- unchangeableness
- sinlessness
- usa
- spiritu
- mysticisms

# Sligo (Sligonian)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:


In [14]:
query_df(results_to_df('Sligo'),4, 2, "count")

File: 2017-01-04-Spelling-Errors-Sligo.txt
Sligo has 2425 rows


Unnamed: 0,spell_error,count,word_length
249,sligon,36,6
1214,schwab,30,6
1300,mattingly,22,9
2318,kuppenheimer,20,12
1058,kamoda,15,6
1488,lippart,14,7
1388,dietel,14,6
586,styleplus,14,9
1938,herzog,14,6
1156,rebok,13,5


## Report

### Words to Add

- hallowe'en
- classmen


# SOL (The Sentinel of Liberty)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [15]:
query_df(results_to_df('SOL'), 9, 2, "count")

File: 2017-01-04-Spelling-Errors-SOL.txt
SOL has 7887 rows


Unnamed: 0,spell_error,count,word_length
7320,bsl,79,3
2290,agl,51,3
6457,mutchler,44,8
6427,sabbatteans,38,11
3200,loth,38,4
262,tion,37,4
4502,'the,33,4
4139,farmakis,29,8
7683,ment,28,4
3089,saloonmen,27,9


## Report

### Words to Add

- sabbatteans

# ST (Signs of the Times)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:


In [16]:
query_df(results_to_df('ST'), 9, 2, "count")

File: 2017-01-04-Spelling-Errors-ST.txt
ST has 218759 rows


Unnamed: 0,spell_error,count,word_length
143577,tion,2185,4
113227,'the,1527,4
183046,eze,1301,3
188020,altho,1275,5
6577,ment,1184,4
23300,pre,791,3
194004,ets,791,3
49006,'of,782,3
3487,sel,778,3
17579,tions,614,5


## Report

- line ending problems
- thesignsofthetimes -> title error occurring 24 times.

### Words to Add

- sabbaton
- overcomer
- burgundians
- spiritistic
- honorius
- nabonadius
- ekklesia
- valentinian
- bannerman
- arians
- dragonic
- hussites
- spiritists
- rappings
- bartimeus
- unroofed
- distributer
- fulfilments
- cesarea
- quartette
- restraineth
- gipsy
- vicarius
- itinerating
- horatius
- demoniacs
- quitted
- unrebuked
- sermonets
- placidia
- embezzlements
- unrepentable
- unstinted
- helpmeet
- pedobaptists
- sabbatize
- saintship
- wailings
- chaldaic
- lacunza
- substitutionary
- spiritist
- overturnings
- epiphanius
- plentitude
- zionistic
- naturedly
- colaborer
- idolator
- soulism
- quartettes
- creatorship
- almsgiving
- ignatian
- epiphaneia
- antigonus
- zionward
- alexandrinus
- licinius
- ecstacy
- legalists
- reprobated
- sabbatum
- bolshevists
- mummeries
- waldensians
- metaphrastes
- extortions
- harkening
- repellant
- unpardoned
- taborites
- embryotic
- defilements
- baalism


# SUW (Southern Union Worker | Southern Tidings)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [17]:
query_df(results_to_df('SUW'), 9, 2, "count")

File: 2017-01-04-Spelling-Errors-SUW.txt
SUW has 35161 rows


Unnamed: 0,spell_error,count,word_length
33462,bfl,912,3
2585,agts,838,4
30115,chas,433,4
33942,ords,415,4
10650,bracy,289,5
15563,vagh,282,4
26188,wks,264,3
14101,billups,241,7
17041,chastain,238,8
22054,lennan,233,6


## Report

- lots of people names
- line endings?


### Words to Add

- dolorosa
- odorized
- orpheum

# TCOG (The Church Officers Gazette)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [18]:
query_df(results_to_df('TCOG'), 9, 2, "count")

File: 2017-01-04-Spelling-Errors-TCOG.txt
TCOG has 10136 rows


Unnamed: 0,spell_error,count,word_length
5831,'the,106,4
8140,eze,59,3
2897,mal,55,3
2492,'of,51,3
6006,mayta,45,5
4803,tbe,45,3
6951,scudder,40,7
2630,'and,39,4
8792,'to,38,3
7083,agtte,36,5


## Report

## Words to Add

none

# TMM (The Missionary Magazine)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [20]:
query_df(results_to_df('TMM'), 4, 2, "count")

File: 2017-01-04-Spelling-Errors-TMM.txt
TMM has 6256 rows


Unnamed: 0,spell_error,count,word_length
4361,raratonga,43,9
3409,buluwayo,37,8
4101,carthy,20,6
1813,karmatar,20,8
1260,stauffer,20,8
1743,kalaka,20,6
2049,hausaland,19,9
2626,okohira,18,7
6066,hasegawa,18,8
5005,couva,17,5


## Report

- place name heavy


### Words to Add

- raratonga
- buluwayo
- karmatar
- tongatabu
- bootooba


# WMH (West Michigan Herald)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [21]:
query_df(results_to_df('WMH'), 9, 2, "count")

File: 2017-01-04-Spelling-Errors-WMH.txt
WMH has 6802 rows


Unnamed: 0,spell_error,count,word_length
6161,sabbathschool,170,13
2631,presidenta,75,10
3801,treasurere,61,10
5854,secretarym,61,10
581,numbersin,53,9
5936,numbessin,52,9
1071,horr,39,4
3898,'the,36,4
195,blendon,32,7
5002,wyla,32,4


## Report

- looks like there is a lot of noise in the PDFS -- there are many words with added characters.

### Words to Add

- benefitted

# YI (The Youth's Instructor)

## General Information

- Date range: 
- Publication cycle: 
- Layout notes: 
- Publisher: 
- Topics:

In [22]:
query_df(results_to_df('YI'), 9, 2, "count")

File: 2017-01-04-Spelling-Errors-YI.txt
YI has 98091 rows


Unnamed: 0,spell_error,count,word_length
85050,sabbathschool,607,13
3429,'the,408,4
65800,'em,399,3
65897,eze,316,3
45605,xil,315,3
43104,ver,302,3
49812,sel,254,3
30373,tion,227,4
31970,mal,214,3
44178,'of,211,3


## Report

### Words to Add

- zambesi
- africaner
- birdlings
- centauri
- ev'ry
- birdling
- herrnhut
- medicator
- sulphite
- multnomah
- dusseldorf
- flowerets
- pepita
- caressingly
- lefevre
- kalakaua
- limbed
- medicators
- montanists
- dreadnaughts
- thornless
- loadstone
- matinee
- dishwashing
- philadelphus
- gracias
- fronded
- twilights
- juanito


In [24]:
# %load shared_elements/system_info.py
import IPython
print (IPython.sys_info())
!pip freeze

{'commit_hash': '5c9c918',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': '/Users/jeriwieringa/miniconda3/envs/dissertation2/lib/python3.5/site-packages/IPython',
 'ipython_version': '5.1.0',
 'os_name': 'posix',
 'platform': 'Darwin-16.3.0-x86_64-i386-64bit',
 'sys_executable': '/Users/jeriwieringa/miniconda3/envs/dissertation2/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, '
                '17:52:12) \n'
                '[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]'}
anaconda-client==1.5.5
appnope==0.1.0
argh==0.26.1
blinker==1.4
bokeh==0.12.3
boto==2.43.0
bz2file==0.98
chest==0.2.3
cloudpickle==0.2.1
clyent==1.2.2
dask==0.12.0
datashader==0.4.0
datashape==0.5.2
decorator==4.0.10
docutils==0.12
doit==0.29.0
gensim==0.12.4
Ghost.py==0.2.3
ghp-import2==1.0.1
gspread==0.4.1
HeapDict==1.0.0
httplib2==0.9.2
husl==4.0.3
ipykernel==4.5.2
ipython==5.1.0
ipython-genutils==0.1.0
ipyw