# Text cleaning - TCP

The goal of this notebook is to show you how you can use Python to semi-automatically clean Early English Books Online - Text Creation Partnership `txt` files. In a nutshell, the code walks you through common revisions required of many historical texts you'll find in the wild: joining hyphenated words together, replacing common OCR and typing errors, modernizing vocabulary, and so on. At the end of the notebook, you save the resulting output as a new `txt` file, with a few other files logging the changes you've made, and questionable "words" that you might consider looking at more closely. You can alter the code to meet your particular genre's conventions. Note that if you are dealing with a late 17th-18th century text that you have OCRed yourself, there will likely be other corrections to make; this code will hopefully provide a template for additional revisions. 

This notebook is intended to provide a template for people in the humanities who know a little bit of Python, but who are unable to find tutorials that focus on the very first, and very necessary, step in digital history - getting your *own* text documents into a form that can then be analyzed. Most online tutorials, targeted at business and (social/data) science audiences, start at the `preprocessing` stage of tokenizing and lemmatizing, but this assumes cleaner input than what is available for most historians. The code below provides one way to massage your text into a clean(er) format that can then be analyzed with all the tools covered in other tutorials.

Once you have a "clean" version of the text file after this notebook, another notebook should make a number of (optional) transformations to the text. Things like converting number-words to digits, combine together compound words as single terms (e.g. `General_Churchill` instead of `General` and `Churchill` as separate tokens), and converting the period's frequent, and often random, Capitalized Nouns (and Other Parts of Speech) into lowercase nouns, to make named entity recognition and part-of-speech tagging easier. But that's for another notebook.

After that second stage, you can then do the normal preprocessing steps discussed in most NLP tutorials: lemmatizing, finding frequent ngrams, POS tagging, and so forth. This first notebook is focused on creating a `txt` file of the *original* document (a big string, in other words), but with corrections and a few other modifications.

## For intermediate beginners in Python

If you are new to Python, you should look elsewhere for the basics of Python syntax, and of computer programming in general. There are a million (free) resources online, on blogs, and on YouTube. But if you already know those basics, this code provides a variety of techniques that you can use to clean up your own text.

In other words, there are several prerequisites in order to use this code: installing Python 3+ and Jupyter Notebooks, for a start (I'd recommend Anaconda) - Google Colaboratory (https://colab.research.google.com/notebooks/welcome.ipynb) is a new intriguing option that, theoretically, allows you to simply import this notebook into it and off you go! If you simply want to substitute your own file for my sample text, you can just swap out filenames. But if you want to modify or extend the code, some background knowledge is required, namely a familiarity with Python objects (strings, lists and dictionaries particularly) and some of their most common methods. An understanding of regular expressions (aka regex) is also key for humanists who want to manipulate text.

But this notebook does try to describe and explain the programming choices that I have made: why I chose to use a string `replace` method instead of a regex substitution, or why I did this revision before another one... I also try to show alternative options, in case your use case differs from mine. All these little decisions can make a big difference in the end results, and they aren't necessarily obvious to the beginner. They are also what is most lacking in most online tutorials, which only show you a single way to accomplish your (actually, *their*) task.

That said, there are a few important general lessons for how this code deals with text:
 1. Read in a substitution lexicon (as a dict) to subsitute one string for another.
 2. Use Python's str `replace` method to make substitutions.
 3. Use the regex `re.sub` method.
 
 Some might be easier than others for any given use case.

You should also be aware that there are many different characters with similar appearances, e.g. é vs. è vs. e vs. ê vs. ë... If you're doing crosscultural research that refers to proper nouns from other cultures, you'll need to deal with this at several points. Character encoding is a pain, but life would be so dull if we only had the 128 characters of the traditional English character set to play with. Which means, by the way, you should be using a text editor, and not MS Word.

## Workflow

This code is written as a prototype, to work on a single file at a time, and it is conservative in that it allows you to look at each suggested change before you commit to it. 

The first thing you should do is set the file paths to your system, specify the file you want to run the code on, and identify whether this is the first time running on a specific file, or whether you're rerunning the code (see below for explanations).

Then, if you're the trustworthy type or just want to see what the code does, you can just run all the cells in one fell swoop (in Jupyter notebook: `Kernel-Restart & Run All`) and then check the revised text and the change log for a list of the changes.

Assuming you want to explore the possible changes first, each type of edit has several steps:
1. Figure out the types of errors you need to fix. The code includes a number of standard types of errors you're likely to find in 17C-18C English texts. You can copy or edit the code to add your own corrections.
2. For each error type, there are three basic types of corrections to choose from, each in a separate function. You either correct the error by specifying the string to replace, or you specify the error with a regular expression, or you read in a substitution list that will substitute column A (the error) for column B (the correction).
3. The first bit of code in each section will indicate how many of that type of error exist in the document. In a few cases, the errors aren't consistent enough to fix programmatically. These types of errors are saved to separate files that you can look through, make the proper corrections, and then read back into the (end of the) notebook to fix those as well.
4. If there are no errors of type X, the code will create an empty dict for audit purposes, letting you know that it looked for error type X but there weren't any.
5. If error type X *does* exist in your text, it will show you the errors and run the cell that makes the corrections. Most of the edits are made either with substitution lists (reading a two-column `csv` file in as a dictionary), or with a simple string `replace` method, or using the regular expression `re.sub` method. The edit code also writes the (unique) changes to a separate dictionary (i.e. change log for an audit trail), and prints the keys and values so you can skim through the changes. If mistakes were made, tweak the code and rerun that cell. If the results aren't what you expected, rerun the entire notebook - the order in which you run cells makes a big difference in your results.
6. Once you're happy with the edit results, the code cumulatively saves the edits from that section to the text, as a distinct text object.
7. Move on to the next type of error and repeat steps 2-6.
8. If you need to roll back the edits of the section you are working on, say you need to tweak your regex, just reassign the `text1` variable at the end of the previous section, like `text = text1`. (But don't forget to comment this reassignment out when done.) Or, just rerun the whole notebook again.
9. After all the errors have been checked, it saves the output as a `txt` file, the change log as a `csv` file. Look though the change log for any problems - particularly pay attention to the `correct` dict, especially if substring were accidentally substituted, e.g. `particularl` should *not* be changed to `particulars`, for fear of converting `particularly` to `particularsy`.
10. Those errors that require individual examination are written to separate files in the `output` folder. You can open each of those potential errors files, look through them, and add the corrections. Then, you can go back to the notebook, change the `Set flag to read in manual lexica`'s `newdoc` value to `N`, then rerun the code to also read in the edited potential errors file and run the substitution function on them.

Depending on the size of your text document and your machine, the code might take a few minutes to run.

For newbies, remember that your changes are only applied to the text *in memory*, until you explicitly choose to save them to a file. So feel free to iterate through the code, experimenting to see what works best for you - you don't need to worry about overwriting the original `txt` file unless you overwrite that file in the save stage. But if you ever get lost jumping back and forth between sections, just reset by using the `Kernel-Restart & Run` command. If the notebook becomes unresponsive (i.e. a cell won't stop running), use `Interrupt`.

## Supplemental materials

I've included about a dozen substitution lexica in the GitHub repo that the code uses to check for errors. Many of these lexica are based off of Ted Underwood's `DataMunging` repository on GitHub:  https://github.com/tedunderwood/DataMunging/tree/master/rulesets. You should modify your copies or use your own, depending on your own corpus. If you have systematic changes to make, remember that you can use some simple Python code to modify these lexica - read them in as a list or a dict and then make whatever changes with regex before writing them back out to a new `csv`/`txt` file.

## Miscellaneous details

The Visualizing English Print project has already done significant cleaning of the EEBO TCP files. They are downloadable from http://graphics.cs.wisc.edu/WP/vep/vep-tcp-collection/; also on GitHub.  But note that some of the corrections below would still need to be done with VEP texts, and you might well be using this code to clean non-VEP texts, so this notebook is based on the original TCP files.

For similar tutorials focused on slightly more structured documents, see:
1. https://programminghistorian.org/en/lessons/extracting-keywords#build-your-gazetteer
2. https://programminghistorian.org/en/lessons/generating-an-ordered-data-set-from-an-OCR-text-file
3. https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions
4. https://www.meredithpaker.com/updates/regexcleaning
5. https://machinelearningmastery.com/clean-text-machine-learning-python/

The Programming Historian is always a good place to start, though you really should be using Python 3+ and not 2.7.

## Work in progress

I've run this code over half-a-dozen different TCP documents, and each one seems to find additional (and sometimes unique) errors to correct. Even transcribed documents will have errors in them, and some of these errors will not be predictable in advance. You will find, for example, different types of character errors in Rohan's *Compleat Captain*, compared to D'Auvergne's *The history of the campagne in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy*.

So I've chosen to simply add specific corrections for each new documents I run the notebook over, and let the code grow with each new document, rather than try to pare out irrelevant corrections for each document. The `changesdict` audit trail will simply indicate each error the code checked for, and whether there were any changes to it or not. And, frankly, you'll never really know all the types of errors you'll encounter until you look for them.

If you use this code with your own documents, pay particular attention to:
1. The regular expressions. They can be... challenging, and there are usually many corner and edge cases that are hard to imagine until you discover your text has one. Poorly-formed regular expressions will "fail silently", i.e. they'll give you what you asked for, not necessarily what you want, and they'll never kick back an error as an alert. This is why you need to peruse the change log when you're done.
2. The lexica of corrections/substitutions. Pay particular attention to the possibility of replacing substrings within an otherwise correct word - you want to balance the desire to find every character error with the desire to find enough context around each error so you can figure out what the erroneous character relly should be. E.g. if you want to replace `Thro` with `Through`, you'll also get `Throughone` instead of `Throne` if you're not careful. So consider padding your strings with spaces, though recognize that, depending on your regex, you might miss some errors that are are the beginning of a line or which are followed immediately by punctuation. For example, `\S* \S*\*\S* \S*` will find `hello t*here friend`, but won't find `hello t*here.`

In other words, be sure to look through the `changesdict` audit trail whenever you make changes.

## Future improvements

Since I am still a relative new-comer to Python, and it's my first programming language, this is not expert, or Pythonic, code. But it gets the job done, even if it often adopts a brute-force, repetitive strategy. But that's how you learn.


Future improvements would be to further refactor this code to create additional functions for each of the different types of edits (str replace, regex, lexicon as dict), and make it run (automatically) against an entire directory of files. I'm sure the regex could be improved. There are probably also more elegant ways to achieve the goal of cleaning the text - this code allows you to oversee every step of the editing process. And it's conceivable a lot of this could be simplified by tokenizing the text first, but I think these historical transcriptions have too many corner cases for a simple tokenizer. Plus, I really do want to see a cleaned version of the original text, before I start atomizing it for natural language processing (NLP).

If you want to modify this notebook for your own use, there are likely several things you'll need to do:
1. Delete undesired edits from the code
2. Add your own types of errors to the code
3. Curate your own lexica, with substitutions that are common in your corpus.

## Why Python?

You can make some of this notebook's edits with MS Word's find-and-replace feature. You can make more of them with regular expressions and a text editor like Notepad++, TextWrangler, or BBEdit. But using a program like Python will allow you make all these changes more quickly, more flexibly, will create an audit trail which you can refer to later on, and can be rerun on any number of files. It will also allow you to then seamlessly manipulate and analyze your data with dozens of other Python tools, using the same basic syntax, even the same notebook. With thousands of freely-available Python libraries, you can do just about anything you'd like with your text.

# On to the code!

# Import libraries

Step one: load into memory all of the Python libraries you'll be using.

In [1]:
import sys
import re
import csv
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.text import Text

Check which version of Python you are using. This code was made in 3+, and won't be compatible, for example, with Python 2.7.

In [2]:
print(sys.version)

3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


Check which version of NLTK is being used.

In [3]:
print(nltk.__version__)

3.3


# Change following settings before running

Your should set the following for each document, before running the notebook.

## Set basepath

These are used as short references to the directories/paths you read to/write from. Assigning these variables here means I can refer to them later with the short variable name, rather than the long path. Also, since I use both Mac and PC and since the paths are different on my machines, I have both set up so I can switch back and forth. You should  change them to your own computer paths.

You can either type out the paths manually, or do a web search to figure out a way to copy and paste:
1. On Windows, you can find the fullpath in the properties.
2. You can create an Automator script on the Mac to do the same thing, but there are other options.

In [4]:
macrootpath = r'YOUR PATH HERE'

Directory for this project. For example, I have a separate directory for each document that I run this code over, so all the files dealing with the Duke of Rohan's *Compleat Captain* is located in the `/rohan_compleatcaptain/` directory.

In [5]:
#docpath = 'dauvergne1694'

Define the paths

In [None]:
#mac
textpath = macrootpath + '/data/raw/'
processedpath = macrootpath + '/processed/'
lexicapath = macrootpath + '/lexica/'
outputpath = macrootpath + '/output/'
#win
# textpath = winrootpath + docpath + 'data\raw'
# processedpath = rootpath + docpath + '\processed\'

Test the above to make sure it returns a valid path

In [7]:
textpath

'/Users/jamelostwald/Dropbox/JupyterNotebooks/projects/text_clean/dauvergne1694/data/raw/'

## Name document

Enter the filename as it appears in your directory.

In [8]:
textfilename = "1640 Rohan Compleat Captain.txt"

Assign a name for your document to programmatically name output files at the end of the notebook.

In [9]:
filename = 'rohan1640'

## Set flag to read in manual lexica

Some of the errors in the document may require manual correction on your part. This occurs when there is no consistent single correction for a given error, e.g. a character might need to be corrected to several other possible characters. In such cases, the only way to tell which character to correct it to is to look at the context around the error and using your pattern-recognizing brain to figure out the correction. 

The notebook will save such potential errors to separate `csv` files, which you can then copy as `...list1`, and fix the entries as needed. When you want to incorporate these corrections into the `text`, set the `newdoc` flag below to `N` - the code will then use that test to load in the corrected `list1` and subsitute them.

When you first run this code on a document, keep `newdoc` set to `Y`.

In [10]:
newdoc = 'Y'

# Read in document to clean

The below code reads in the text document as an object called `text`. This original `text` will get passed from one section of code to the next, each section making (cumulative) changes to it.

If there is an encoding issue (e.g. at the beginning of the printout below, you might find some weird characters in the output, or you get a `codec invalid start byte` error), resave your original text file in a text editor as `UTF-8` (**without** BOM, if you're using a Mac).

In [11]:
with open(textpath + textfilename, 'r',encoding='UTF-8') as f:
    text = f.read()
text

"The history of the campagne in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy\nD'Auvergne, Edward, 1660-1737.\n\nBy EDWARD D'AUVERGNE, M. A. Rector of St. Brelade, in the Isle of JERSEY, and Chaplain to Their Majesties Regiment of Scots Guards.\n\nLONDON, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.\n\n\nImprimatur,\n\nNovemb. 20. 1694.\nEDWARD COOKE.\n[page]\nTo the Honourable MAJOR-GENERAL RAMSAY, Colonel of Their Majesties Regiment of Scots Guards, &c.\n\nSIR,\nI Need not make an Apology for Pre∣senting the Account of the Last Cam∣pagne to You; for since Custom will have every Trifle that is publish'd, at∣tended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this Oc∣ccasion to acknowledge to the World the many Obligations I have to You: Though, to acquit my self of it, I must put your Ho∣nourable Name to a Piece in which I am sen∣sible You 

How many *characters* does the above text object have?

In [12]:
len(text)

206079

# Clean the document

Now we begin to check the document for problems, correct them, and keep track of the changes we make.

How do you know what to correct? You could start by comparing every word in your text against a dictionary. But this might give you a LOT of "errors", many of which wouldn't actually be errors, but just proper nouns or domain-specific vocabulary that are not in your word dictionary. These could include valid terms like, in my case, the parts of fortifications, obscure military terminology, and so on.

So it's probably better to start by skimming through the text and looking for obvious errors. Once you've cleaned a bunch of them programmatically, you can then catch the less common errors by comparing the tokens with a regular dictionary and a lexicon of proper nouns. If you're serious about this, you'll want to create your own collection lexica for your specific domain, particularly people's names, places (toponyms), events, groups, concepts, etc.

Below I've created a whole bunch of likely errors, likely for the types of documents I want to clean. Your mileage may vary.

## Create `edits` variable to track order of changes

This following cell creates an `edits` variable that will keep track of which edits you perform, in what order. This `edits` string will be added to the end of your final filename in the output stage, e.g. it might name the file `rohan1640_clean_nb1_aqdfƲo&*q$cp_'vo-hcesh3cfp r2nlpnc_nb1.txt`. This suffix allows an easy way to keep track of which version of your cleaned text you are working with at any point in time. This is important for documenting your method and for replicability, since the order of cleaning steps might affect your results. 

See: Denny, Matthew, and Arthur Spirling. “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It.” SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, September 27, 2017. https://papers.ssrn.com/abstract=2849145.


But if you don't want that, you can delete that code.

In [13]:
edits = ''

## Create a copy of your original text

Generally, the same copy of `text` will be passed on to the next section, so that the changes are cumulative. If you ever want to compare your corrected version with the original, you can simply reload the cell that reads in the original.

At the end of each section, i.e. after each main edit, you can preserve your edited text at that point in the code with a cell `text_edittypeX = text`. Thus, you can roll back changes in the next section of code (if needed) by reversing the order: `text = text_edittypeX`. Easier than having to rerun the entire notebook every time you make too many changes, if you're troubleshooting some of your code.

To view the text with its revisions up to that point, just type:

In [14]:
print(text) # or just 'text'

The history of the campagne in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D'Auvergne, Edward, 1660-1737.

By EDWARD D'AUVERGNE, M. A. Rector of St. Brelade, in the Isle of JERSEY, and Chaplain to Their Majesties Regiment of Scots Guards.

LONDON, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.


Imprimatur,

Novemb. 20. 1694.
EDWARD COOKE.
[page]
To the Honourable MAJOR-GENERAL RAMSAY, Colonel of Their Majesties Regiment of Scots Guards, &c.

SIR,
I Need not make an Apology for Pre∣senting the Account of the Last Cam∣pagne to You; for since Custom will have every Trifle that is publish'd, at∣tended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this Oc∣ccasion to acknowledge to the World the many Obligations I have to You: Though, to acquit my self of it, I must put your Ho∣nourable Name to a Piece in which I am sen∣sible You must find a great

# Brute-force technique

If you didn't care about the details, you could use the following code to 'clean' the text: get rid of everything that isn't a letter, number, space, or basic punctuation, and display the results. Some preprocessing online tutorials even suggest this.

In [15]:
pattern = re.compile(r'[^a-zA-Z\d\s\,\.;]') # everything that isn't in the bracketed regex
text2 = re.sub(pattern,' ',text)
print(text2)

The history of the campagne in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D Auvergne, Edward, 1660 1737.

By EDWARD D AUVERGNE, M. A. Rector of St. Brelade, in the Isle of JERSEY, and Chaplain to Their Majesties Regiment of Scots Guards.

LONDON, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple Barr, in Fleet street, 1694.


Imprimatur,

Novemb. 20. 1694.
EDWARD COOKE.
 page 
To the Honourable MAJOR GENERAL RAMSAY, Colonel of Their Majesties Regiment of Scots Guards,  c.

SIR,
I Need not make an Apology for Pre senting the Account of the Last Cam pagne to You; for since Custom will have every Trifle that is publish d, at tended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this Oc ccasion to acknowledge to the World the many Obligations I have to You  Though, to acquit my self of it, I must put your Ho nourable Name to a Piece in which I am sen sible You must find a great

But we don't want this - too many new problems, and it's now too difficult to figure out what corrections we *should* have made. Instead, we can use a more detailed process to identify specific problems and then correct them intelligently. It will require much more code, but it is largely automated, with a much more acceptable end result.

# Functions

Here are a series of functions, which we'll use to fix most of the errors. (Some of these might not actually be used in the current version of the notebook.)

If you put functions at beginning of the notebook, make sure to define all the objects referenced by those functions, e.g. `text`, various change dicts, etc.

## `lexiconreplaceassign` function

One of the main ways to fix errors is to read into memory a pre-curated list of potential errors, check the `text` for any of these strings, and if it finds any of them, correct that error and add both the error and the correction to the `changesdict`, for audit purposes.

First, I define a function that takes three variables: `old_text`, `new_text`, and `active_text`. When that function is called, it will take the three variable values you provide and replace them in the replacement, and also add them to the specified change dict.

Note that this creates a nested dict, with a nested subdict for each type of error. This will require you to use a bit of code to parse different levels of the dict. In general, the `changesdict` looks like this, with each type of error (`vv`, `hyphen`...), followed by a nested dict indicating which specific changes were made:

    {'vv': {'this VVork, at': 'this work, at',
           'the VVork well,': 'the work well,'},
    'weirdo': {},
    'hyphen': {'a-breast': 'abreast',
               'musket-fire': 'musketfire'}
    }

Function to take a lexicon, replace the errors for the correction listed in the lexicon, add those error-correction pairs to the `changesdict`, and then assign the resulting string as `text`.

In [16]:
def lexiconreplaceassign(edittype,outerdict,lexicon,active_text):
    for old, new in lexicon.items():
        if old in active_text: #keep for pre-curated lexica
            active_text = active_text.replace(old,new) #put this before dict assignment 
            outerdict[edittype][old] = new
    return active_text

## `addextra` function

General `addextra` function: replace the errors in `text` (listed in a separate dict) with substitutions specified as an argument in the function call, and add each error and correction to the `changesdict`. 

In [17]:
def addextra(edittype,outerdict,old,replacement,active_text):                      
    if old in active_text:
        outerdict[edittype][old] = {}
        outerdict[edittype][old] = replacement
    active_text = active_text.replace(old,replacement) 
    return active_text

A variation on the above, if you want to group the errors together by the type of correction to be made.

In [18]:
def addextraenum(edittype,outerdict,old,replacement,active_text,num_list):                      
    for i,val in enumerate(squares): #NB not using squares1, because we only need its index number
        for j in num_list:
            if j == i:
                outerdict[edittype][val] = {}
                outerdict[edittype][val] = val.replace(old,replacement)
                active_text = active_text.replace(val,val.replace(old,replacement))
    return active_text

## `textreplacereplace` function

This is a simple function that takes the error and the substitution, replaces the error in the `text` and adds them to the appropriate nested dict.

In [19]:
def textreplacereplace(edittype,outerdict,old,replacement,active_text):
    if old in active_text:
        outerdict[edittype][old] = replacement
        active_text = active_text.replace(old,replacement)
    return active_text

## `textreplaceassign` function

If you need to use the `str.replace` method and assign the dict value

NB:
1. This is brute force, not regex. We create a list (regex), then in that list do text replace for text and changedict.
2. You don't want to create an empty nested dict here, in case you need to run the code several times

In [20]:
def textreplaceassign(edittype,outerdict,old_text,new_text,active_text):  
    if old_text in active_text:
        #outerdict[edittype] = {} #create an empty outerdict; assumes each edittype only uses one function
        outerdict[edittype]['old_text'] = {} #create empty innerdict for old value
        outerdict[edittype]['old_text'] = old_text #assign old value
        outerdict[edittype]['new_text'] = {}
        outerdict[edittype]['new_text'] = new_text
        active_text = active_text.replace(old_text,new_text)
    return active_text

# Start making changes to the text and save them in the `changesdict`

# Delete front-matter

If you are focusing on the content of a book, you might want to delete the front matter: title page, publisher info, etc. For example, if you want to extract the toponyms (place names) from the text, you might not want the place of publication to be included in that.  Or maybe you don't want the name of the person to whom the book was dedicated in your list of historical people mentioned in the narrative. It's also worth checking at the end as well, especially for OCRed text boilerplate.

# General procedure

Generally, if the goal is to end up with a faithfully rendered version of the text that retains all the important features of the original (important defined by your purpose), we need to balance our desire to automate the corrections as much as possible with the need to not overcorrect. Particularly since we don't always know the exact problems that need fixing, nor the correct corrections. That means we're going to make our changes targeting the most-certain errors and the most-certain correction, which usually means starting with the most precisely-targeted errors and then get more broad. At the end, we can do one last look through the possible errors, and change them individually.

The order in which you perform these various corrections may make a difference. Especially challenging is that some words with other errors might need one of the errors to be corrected before the other errors will become evident to our code, e.g. `com- i * pound` might need to have the `i *` deleted (created from stray marks in the right margin of  line 1 and in the left margin of line 2) before the code will know to rejoin the `com-` and `pound` into `compound`. We humans can recognize that problem right away, but there might be hundreds of such errors in a single document, and we don't want to have to correct them all by hand.

With some of the errors, e.g. invalid charaters, you *could* create regex patterns to find each character, then use a string replace method to replace every instance of the invalid character with its valid replacement. However, we'd like to keep an audit trail of which changes we made, as well as the entire word in which such an invalid character appeared. This would allow us to make sure the change is legit. So instead of doing a string replace method on the `text`, we'll use regex to identify invalid characters and the word that they belong in, then create a substitution dict that our `lexcionreplaceassign` will work through, adding each change to the `changesdict` audit trail. That way, we'll make the substitutions in the `text` object and have a running audit trail of which words were changed to what.

# Create `changesdict` for auditing

In order to create an audit trail of the corrections made, we need to create a separate dict to store the old and new values. You could create a separate change dict for each type of change, but instead we'll have a nested dict, so that we have a running tally. Here I make it a plain dict, but for each type of correction, I will create a nested dict inside it.

NB: The `edits` string created above is distinct from this change dict; `edits` is only to keep track of the order of the changes and append that string to the filename at the end of the program. That way if the edited text file and the changesdict file get separated, you'll always know the order of edit within the name of the text file.

Create the empty dict

In [21]:
changesdict = {}

# Tokenize on space only

One of the ways to clean the `text` is to tokenize it, to split each element of the string into tokens. We often think of tokens as individual words, but this will depend on which delimiters you use to tokenize.

To standardize the punctuation in the text, we'll need to split our text into tokens. But if you use the default NLTK tokenizer, that uses punctuation as well as whitespace as delimiters. Instead, we'll use the regex `split` function and only split on whitespaces. You wouldn't want to use this tokenize method for most NLP uses, but if all you're looking for is punctuation and some characters around it for context, it'll work.

In [22]:
wordspace = text.split()
wordspace

['The',
 'history',
 'of',
 'the',
 'campagne',
 'in',
 'the',
 'Spanish',
 'Netherlands,',
 'Anno',
 'Dom.',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne,",
 'Edward,',
 '1660-1737.',
 'By',
 'EDWARD',
 "D'AUVERGNE,",
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade,',
 'in',
 'the',
 'Isle',
 'of',
 'JERSEY,',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesties',
 'Regiment',
 'of',
 'Scots',
 'Guards.',
 'LONDON,',
 'Printed',
 'for',
 'Matt.',
 'Wotton,',
 'at',
 'the',
 'Three',
 'Daggers;',
 'and',
 'John',
 'Newton,',
 'at',
 'the',
 'Three',
 'Pigeons,',
 'near',
 'Temple-Barr,',
 'in',
 'Fleet-street,',
 '1694.',
 'Imprimatur,',
 'Novemb.',
 '20.',
 '1694.',
 'EDWARD',
 'COOKE.',
 '[page]',
 'To',
 'the',
 'Honourable',
 'MAJOR-GENERAL',
 'RAMSAY,',
 'Colonel',
 'of',
 'Their',
 'Majesties',
 'Regiment',
 'of',
 'Scots',
 'Guards,',
 '&c.',
 'SIR,',
 'I',
 'Need',
 'not',
 'make',
 'an',
 'Apology',
 'for',
 'Pre∣senting',
 'the',
 

# Standardize Apostrophes

Some punctuation marks have a suprising number of variations. If we're using OCRed text, it's possible that the OCR converted them into different versions, such that it thinks there might be more than one type of "apostrophe"; another option is that the original source used different types, and the transcriber/OCR retained that distinction. 

But I don't like that. You can't imagine how frustrating it is trying to figure out why your simple code isn't working, only to find that the problem is actually that you are searching for the wrong "kind" of apostrophe!

If we want to correct any strings that have apostrophe variants, we should first convert them all to a single type, the standard straight apostrophe - `'`. Then, when we do any searches or replacements, we know that we will get every one of them, rather than worry about which kind of apostrophe we're looking at, or running the same correction across each type of apostrophe. If we want, we can then convert known apostrophes, say, possessives, to one of the other flavors of apostrophe, which would allow us to mass update the straight apostrophe without fear of losing possessives.

Searching for an apostrophe in a string, however, is a bit more complicated due to Python's syntax. Since we often use the apostrophe to indicate the beginning and end of a string, that can be confused with the apostrophe as a real character. So we have two options if we want an apostrophe within a string. The first is to use quotation marks - `" "` - to indicate the string boundaries; Python doesn't care if the string boundaries are single or double quotes. Alternately, if we use regex, we need to add an extra character. Some characters have special meaning in regex, which means you need to "escape" them to let regex know that they are intended as literal characters. You can "escape" a literal character by putting a backslash (`\`) in front of it. In this case, it would look like `'\''`. Or, you could just do `"'"`.

In later sections, we'll be able to automate the corrections: given an string X, we substitute Y for it. But these apostrophes are tricky. We'll want to keep some of them, e.g. the possessives like `Majesty's`, and otherse like elisions: `D'Auvergne`. Some, however, are archaic syncopated contractions of the past tense of verbs (`express'd`) that we'll want to change. We'll deal with them separate in later sections.

Make a new list, `apost`, for all words with apostrophe variations, based off `wordspace` token list.

In [23]:
apost = [] #creates an empty list
for i in wordspace: #beginning of for each item loop
    if re.findall('\'',i) or re.findall('’',i) or re.findall('`',i) or re.findall('‘',i): #if loop using regex to findall all occurences of x or y...
        apost.append(i) #append each item matching above if criteria to the apost list
apost #print apost list items

["D'Auvergne,",
 "D'AUVERGNE,",
 "publish'd,",
 "'tis",
 "encreas'd",
 "Re∣gain'd",
 "Repuls'd",
 "express'd",
 "KING's",
 "Train'd",
 "'tis",
 "Rais'd",
 "Majesty's",
 "Majesty's",
 "D'Auvergne.",
 "forc'd",
 "Copy'd,",
 "Tran∣scrib'd",
 "Copy'd",
 "us'd",
 "impos'd",
 "concern'd",
 "Profess'd",
 "Year's",
 "flush'd",
 "oblig'd",
 "push'd",
 "supply'd",
 "weaken'd",
 "Bouillon's",
 "Cologne's",
 "'tis",
 "King's",
 "look'd",
 "advanc'd",
 "open'd",
 "long'd",
 "quarter'd",
 "call'd",
 "Ingoldsby's",
 "dispos'd",
 "consider'd,",
 "oblig'd",
 "Conquer'd",
 "believ'd;",
 "us'd",
 "smother'd",
 "King's",
 "fatigu'd",
 "weaken'd",
 "or∣der'd",
 "'twas",
 "King's",
 "turn'd",
 "King's",
 "bestow'd",
 "ow'd",
 "King's",
 "quarter'd",
 "march'd",
 "quarter'd",
 "joyn'd",
 "Lauder's",
 "Ferguson's;",
 "(Argyle's",
 "remain'd",
 "march'd",
 "march'd",
 "encamp'd",
 "march'd",
 "march'd",
 "joyn'd",
 "remain'd",
 "quarter'd",
 "joyn'd",
 "Nassau's",
 "remain'd",
 "Meloniere's",
 "encamp'd",
 "ma

In [24]:
len(apost) #how many items are in the list?

458

You'll probably notice a few other issues that need to be fixed. All in due time.

To look for the other types of apostrophes, you can also use a `list comprehension` that will put each token with the specific character into a new list, and then append them all into a single list:

In [25]:
apost1 = [s for s in wordspace if '’' in s]
apost1

[]

In [26]:
apost2 = [s for s in wordspace if '`' in s]
apost2

[]

In [27]:
apost3 = [s for s in wordspace if '‘' in s]
apost3

[]

If the above lists are empty, the output is an empty list: `[]`. This indicates that those types of apostrophes do not occur in the given text.

## Change `apost`

If there are variant apostrophes to replace, the following code will replace those other types of apostrophes with the straight apostrophe. If the other apostrophe types don't exist, there will be no underlying changes, but you should still run the code below so you'll know that you did check.

We'll use the `textreplaceassign` function, run earlier. We've created 4 arguments in the first line, and all we need to do is pass the value of each argument, making them variables, and then the code will 1) replace the old text with the new text in the text, 2) create a new dict entry for each change.

First, create the new nested `edittype` dict that changes will get inserted into, in this case:

In [28]:
changesdict['apost'] = {} #create new inner (nested) dict called apost

Now we can simply call the function by its name, add the values of the arguments for each change, and see the results.

In [29]:
aposttochange = ['’','’','‘']

Since we have three types of apostrophes to replace, we could run the function three times, or we can make a list of the apostrophes to change and then loop over that list with the function:

In [30]:
for i in aposttochange:
    textreplaceassign('apost',changesdict,i,"'",text)
changesdict

{'apost': {}}

If there were no changes, the output above should look like: `{'apost': {}}`.

In [31]:
edits += 'a'
edits

'a'

Checkpoint save - in case we are working on the next block of cells and want to restart from this point, rather than wait to rerun the entire notebook over again.

In [32]:
text_a = text

If you come across a problem in the next code section and want to backtrack to the end of this section, you can uncomment the line below and run it, starting fresh in the next section.

In [33]:
#text = text_a

Another tip: if you're working on your code in a long notebook, and want to run your notebook up to (just before) the point which still needs work, you can type something nonsensical into a code cell. Python will error at that cell and stop, allowing you to jump right to where you need to pick up. Personally, I chose `blah` because I couldn't think of anything else to type in the moment, and it's unique, i.e. I can search a notebook for `blah` and I'm unlikely to find a false hit.

In [34]:
#blah #uncomment the beginning of this line if you want the notebook to stop here

# Standardize Quotation marks

As with apostrophes, quotation marks also come in several varieties, e.g. `"`, `“`, and `”`. You should check to see if your text has a single variety, or multiple. You do this in the same way as you did with apostrophes. Again, this may be more important for OCRed text than for hand-entered text like the TCP.

Then, standardize all quotes (e.g. curly quotes) to regular straight quotes.

First, find any double quote marks

In [35]:
quote = []
for i in re.findall(r'\b\S* {0,1}\[“|”] {0,1}\S*\b',text): #the divisor separator inside a regex [] group means OR
    quote.append(i)
quote

[]

## Create `quotedict` and change

Standardize double quote marks. We can use the same function as above, and just change the arguments.

In [36]:
changesdict['quote'] = {}
textreplaceassign('quote',changesdict,"“","'",text)
textreplaceassign('quote',changesdict,"”","'",text)

"The history of the campagne in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy\nD'Auvergne, Edward, 1660-1737.\n\nBy EDWARD D'AUVERGNE, M. A. Rector of St. Brelade, in the Isle of JERSEY, and Chaplain to Their Majesties Regiment of Scots Guards.\n\nLONDON, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.\n\n\nImprimatur,\n\nNovemb. 20. 1694.\nEDWARD COOKE.\n[page]\nTo the Honourable MAJOR-GENERAL RAMSAY, Colonel of Their Majesties Regiment of Scots Guards, &c.\n\nSIR,\nI Need not make an Apology for Pre∣senting the Account of the Last Cam∣pagne to You; for since Custom will have every Trifle that is publish'd, at∣tended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this Oc∣ccasion to acknowledge to the World the many Obligations I have to You: Though, to acquit my self of it, I must put your Ho∣nourable Name to a Piece in which I am sen∣sible You 

In [37]:
changesdict['quote']

{}

Add `edits` abbreviation

In [38]:
edits += 'q'
edits

'aq'

Checkpoint save.

In [39]:
text_q = text

In [40]:
#text = text_q

# Delete `∣` divisor and vertical bar `|`

EEBO TCP texts often use the `∣` character to indicate words split across lines. Delete these if you want `Ho∣nourable` to become `Honourable`.

NB1: There are different types of vertical line characters, e.g. this is a `divisor`, not to be confused with a pipe or `vertical bar`, like this: `|`. To be safe, you can add both to your regex.

NB2: The VEP version of TCP eliminated these divisors.

First we find all instances of words which include the `∣`, to look at what we're dealing with.

In [41]:
divisorlist = []
divisorlist = re.findall(r'\b\S*[|∣]\S*\b',text)
divisorlist

['Pre∣senting',
 'Cam∣pagne',
 'at∣tended',
 'Oc∣ccasion',
 'Ho∣nourable',
 'sen∣sible',
 'Mi∣stakes',
 'Ma∣nagement',
 'Obli∣ging',
 'Fa∣vours',
 'Commen∣dations',
 'Bat∣tel',
 'Con∣duct',
 'Vi∣gour',
 "Re∣gain'd",
 'Ba∣varia',
 'Ma∣king',
 'Dal∣housy',
 'Pro∣sperity',
 "Tran∣scrib'd",
 'Na∣tional',
 'wil∣ling',
 'unani∣mously',
 'Inconveni∣ences',
 'Com∣mon',
 'Coun∣trey',
 'there∣fore',
 'some∣thing',
 'Oppor∣tunity',
 'dis∣concert',
 'tho∣rough',
 'seve∣ral',
 'inso∣much',
 'Com∣pleating',
 'rela∣tion',
 'pur∣pose',
 'Re∣giments',
 'Co∣lonel',
 'Quar∣ters',
 'lamen∣table',
 'Cam∣pagne',
 'Tu∣mults',
 "or∣der'd",
 'Divertise∣ments',
 'be∣fore',
 'Regi∣ment',
 'Batta∣lions',
 'com∣posed',
 'Wirtem∣berg',
 'Artil∣lery',
 'Vil∣lages',
 'Coun∣trey',
 'Luxem∣bourg',
 'Col∣lier',
 'Vil∣vorde',
 'Den∣dermond',
 'In∣fantry',
 'Brus∣sels',
 'be∣tween',
 'Vil∣leroy',
 'Open∣ing',
 'Beth∣lehem',
 'Coun∣trey',
 'Princi∣pality',
 'Coun∣trey',
 'Vil∣lages',
 'Bri∣gadier',
 "re∣view'd",
 'Thisenha

In [42]:
len(divisorlist)

467

### Check your regex

Since regex can be quite tricky, it might be worth comparing your expected results with your actual results. If, for example, you want to find a specific character (like `∣`), but you need to find the surrounding characters for the change dict, you can do the following. Search for that character by itself, note the number of occurrences, and then compare that frequency with the number that your more complicated regex returns. For example, if the regex above (`\b\S*∣\S*\b`) is intended to find *all* divisors, the number that regex returns had better be the same as the number returned if just searching for the divisor sign itself. So let's check.

In [43]:
divis = re.findall(r'[∣|]',text)
if len(divis) != len(divisorlist): #if number is different, warn with "Try again"
    print('Try again')
else:
    print('Good job')

Good job


Since we want to delete all of the divisors, we can use our `lexiconreplaceassign` function by creating a substitution dict to send to the function.

In [44]:
divisordict = {}
for i in divisorlist:
    divisordict[i] = re.sub('[|∣]','',i)
divisordict

{'Pre∣senting': 'Presenting',
 'Cam∣pagne': 'Campagne',
 'at∣tended': 'attended',
 'Oc∣ccasion': 'Occcasion',
 'Ho∣nourable': 'Honourable',
 'sen∣sible': 'sensible',
 'Mi∣stakes': 'Mistakes',
 'Ma∣nagement': 'Management',
 'Obli∣ging': 'Obliging',
 'Fa∣vours': 'Favours',
 'Commen∣dations': 'Commendations',
 'Bat∣tel': 'Battel',
 'Con∣duct': 'Conduct',
 'Vi∣gour': 'Vigour',
 "Re∣gain'd": "Regain'd",
 'Ba∣varia': 'Bavaria',
 'Ma∣king': 'Making',
 'Dal∣housy': 'Dalhousy',
 'Pro∣sperity': 'Prosperity',
 "Tran∣scrib'd": "Transcrib'd",
 'Na∣tional': 'National',
 'wil∣ling': 'willing',
 'unani∣mously': 'unanimously',
 'Inconveni∣ences': 'Inconveniences',
 'Com∣mon': 'Common',
 'Coun∣trey': 'Countrey',
 'there∣fore': 'therefore',
 'some∣thing': 'something',
 'Oppor∣tunity': 'Opportunity',
 'dis∣concert': 'disconcert',
 'tho∣rough': 'thorough',
 'seve∣ral': 'several',
 'inso∣much': 'insomuch',
 'Com∣pleating': 'Compleating',
 'rela∣tion': 'relation',
 'pur∣pose': 'purpose',
 'Re∣giments': 'Regi

Check to makes sure it worked.

## Create `divisordict` and change

Now that we are assured of our regex, we can make the changes.

Make subdict

In [45]:
changesdict['divisor'] = {}

Replace

In [46]:
text = lexiconreplaceassign('divisor',changesdict,divisordict,text)
changesdict

{'apost': {},
 'quote': {},
 'divisor': {'Pre∣senting': 'Presenting',
  'Cam∣pagne': 'Campagne',
  'at∣tended': 'attended',
  'Oc∣ccasion': 'Occcasion',
  'Ho∣nourable': 'Honourable',
  'sen∣sible': 'sensible',
  'Mi∣stakes': 'Mistakes',
  'Ma∣nagement': 'Management',
  'Obli∣ging': 'Obliging',
  'Fa∣vours': 'Favours',
  'Commen∣dations': 'Commendations',
  'Bat∣tel': 'Battel',
  'Con∣duct': 'Conduct',
  'Vi∣gour': 'Vigour',
  "Re∣gain'd": "Regain'd",
  'Ba∣varia': 'Bavaria',
  'Ma∣king': 'Making',
  'Dal∣housy': 'Dalhousy',
  'Pro∣sperity': 'Prosperity',
  "Tran∣scrib'd": "Transcrib'd",
  'Na∣tional': 'National',
  'wil∣ling': 'willing',
  'unani∣mously': 'unanimously',
  'Inconveni∣ences': 'Inconveniences',
  'Com∣mon': 'Common',
  'Coun∣trey': 'Countrey',
  'there∣fore': 'therefore',
  'some∣thing': 'something',
  'Oppor∣tunity': 'Opportunity',
  'dis∣concert': 'disconcert',
  'tho∣rough': 'thorough',
  'seve∣ral': 'several',
  'inso∣much': 'insomuch',
  'Com∣pleating': 'Compleating

Remember that the dict will only contain *unique* keys (i.e. you overwrite the value when the same key is entered a second time), so we shouldn't necessarily assume its length to be equal to `divisorlist` above.

If you wanted to check whether all the divisors were deleted, you can search for it in the `text`

In [47]:
re.findall(r'\b\S*∣\S*\b',text)

[]

In [48]:
edits += 'd'

Check your results - if the character is still present, this will create an error that will exit the code and warn you. Useful if you are running, rather than walking, through code.

In [49]:
text_div = text

In [50]:
#text = text_div

# Replace `ſ` long-s character

Many 17C-18C works used a `ſ` ("long-s") character to substitute for an `s`. The "rules" were vague and fluid, as this blog post suggests: http://babelstone.blogspot.com/2006/06/rules-for-long-s.html. And I've found exceptions to even those rules, so all bets are off.

To allow us to use natural language processing tools, we need to replace them with a normal `s`. VEP already converted them; you'll also come across them in Google Books. Here, we will convert the `ſ` to an `f`, instead of an `s`, on the off-chance that a few might actually be `f` instead of `s`. Don't worry - in a later section of the notebook, we'll read in a list of common long-s words where the `ſ` character has been recorded as an `f`.

In [51]:
longfslist = []
longfslist = re.findall(r'\b\S*ſ\S*\b',text)
longfslist

[]

Make dict for lexicon

In [52]:
longfsdict = {}
for i in longfslist:
    longfsdict[i] = i.replace('ſ','f')
longfsdict

{}

## Create `longfsdict` and change

In [53]:
changesdict['longf'] = {}

In [54]:
text = lexiconreplaceassign('longf',changesdict,longfsdict,text)
changesdict['longf'] #Now that dict getting long, only displaying the current subdict

{}

In [55]:
edits += 'f'
text_f = text

In [56]:
if 'ſ' in text:
    blah # this is just something to create an error

# Replace `Ʋ` character

Use the same basic code for other single-character edits:
1. Create a list of the tokens to be changed.
2. Figure out what the correct character should be. In this case, `Ʋ` is a `u`.
3. Add the old and new tokens to a change dict.
4. Use the `replace` method on the entire text.

If you want to know what a specific character is called, you can just paste that character into a Google search. But you already knew it is a labiodental approximant, didn't you?

VEP deleted these

NB: There are both upper and lower case versions, and they might have been modified with the case change above. `Ʋʋ`

Use regex to find all occurrences of `Ʋ` or `ʋ`

In [57]:
labapplist = []
labapplist = re.findall(r'\b\S*[Ʋʋ]\S*\b',text)
labapplist

['Ʋnited',
 'BRƲGES',
 'DIXMƲYDE',
 'AƲDENARDE',
 'HAGƲE',
 'BOISLEDƲC',
 'BRƲGES',
 'GERTRƲDENBERG']

In [58]:
labappdict = {}
for i in labapplist:
    if 'Ʋ' in i:
        labappdict[i] = i.replace('Ʋ','U')
    else:
        labappdict[i] = i.replace('ʋ','u')
labappdict

{'Ʋnited': 'United',
 'BRƲGES': 'BRUGES',
 'DIXMƲYDE': 'DIXMUYDE',
 'AƲDENARDE': 'AUDENARDE',
 'HAGƲE': 'HAGUE',
 'BOISLEDƲC': 'BOISLEDUC',
 'GERTRƲDENBERG': 'GERTRUDENBERG'}

## Create `labappdict` and change

In [59]:
changesdict['labapp'] = {}

Call function. Note that we treat lower and uppercase separately.

In [60]:
text = lexiconreplaceassign('labapp',changesdict,labappdict,text)
changesdict['labapp']

{'Ʋnited': 'United',
 'BRƲGES': 'BRUGES',
 'DIXMƲYDE': 'DIXMUYDE',
 'AƲDENARDE': 'AUDENARDE',
 'HAGƲE': 'HAGUE',
 'BOISLEDƲC': 'BOISLEDUC',
 'GERTRƲDENBERG': 'GERTRUDENBERG'}

In [61]:
'Ʋnited' in text

False

In [62]:
edits += 'Ʋ'
text_u = text

# Replace `°` character

It's possible that OCR may have misread the letter `o` as a degree symbol, `°`, or maybe even data entry clerks did it. We should check.

In [63]:
degreelist = []
degreelist = re.findall(r'[a-zA-Z]+°[a-zA-Z]+',text)
degreelist

[]

Note the importance of regex here. Just in case you have legitimate `°` signs in your text, the regex above says only to find those that are surrounded by other alphabetical characters. That means that `34° F` will not be flagged; of course, it also means that it would miss something like `hell° there` or `an °utlaw`. Welcome to the hell of regex edge cases.

In [64]:
degreedict = {}
for i in degreelist:
    degreedict[i] = i.replace('°','o')
degreedict

{}

## Create `degreedict` and change

We can change any degree signs to `o`'s, though we'll want to make sure first that they are warranted.

In [65]:
changesdict['degree'] = {}

In [66]:
text = lexiconreplaceassign('degree',changesdict,degreedict,text)
changesdict['degree']

{}

In [67]:
edits += 'o'
text_o = text

# Replace other weird characters

If the divisor, labiodental approximate and degree symbol have made you a bit paranoid, they should have. To see what's really out there lurking in our text, we can check for any 'weird' characters that we might need to catch and domesticate. We'll do this with regex. 

"Weird" and "not-weird" is all relative, of course, to which language you're using, and what types of documents are in your corpus. A modern historical biography will probably not have many Greek characters, whereas an early modern history might; and we won't even mention early modern alchemical texts... 

Technically, this is often an issue of charater encoding, particularly a result of the expansion from the earlier ASCII character set (128 possible characters, those most common in English, the main language for early computer programming) to a much much broader character encoding system called Unicode. Search online for details, but this is what allows us to have non-Latin (i.e. non-English) alphabets, as well as millions of different emojis. 
You can decide, for your particular use case, which characters you want to include and exclude by crafting your regex accordingly.

But instead of trying to imagine every possible character that might appear, we'll exclude what we know we *want* from our search. In my case, I want to exclude all of the standard English-language characters listed inside the brackets. The result will display any other characters.

In [68]:
weird = re.findall(r'[^a-zA-Z\d\s\/\[\\(\);:\',.!?\\-\]]',text)
set(weird)

{'&', '*', '-', 'ç', 'è', 'é', '̄', '—', '•', '▪', '〈', '〉'}

To see all those characters in context, use a slightly-expanded regex

In [69]:
weirdcontext = re.findall(r'\b\S*[^a-zA-Z\d\s\/\[\\(\);:\',.!?\\-\]]\S*\b',text)
set(weirdcontext)

{'1660-1737',
 'André',
 'Artillery-Horses',
 'Battering-Pieces',
 'Bernstorf-Zell',
 'Blood-Royal',
 'Bomb-Ketches',
 'Bread-Waggons',
 'Breast-works',
 'Burgh-Masters',
 'Coal-pits',
 'Cohorné',
 'Colonel-General',
 'Commissary-General',
 'Condè',
 'Condé',
 'Cossé',
 'Court-Marshal',
 'Day-time',
 'Dompré',
 'Draw-Bridge',
 'Drinking-Money',
 'Earth-work',
 'Field-Officer',
 'Field-Pieces',
 'Field-pieces',
 'Fimarçon',
 'Fire-Armes',
 'Fitz-Patrick',
 'Fleet-street',
 'Flower-de-luces',
 'Françoises',
 'Fren̄ch',
 'Frosty-weather',
 'Gassé',
 'Great-Britain',
 'Guards-Room',
 'Head-Quarter',
 'Head-dress',
 'Holstein-Beck',
 'Holstein-Norbourg',
 'Holstein-Norburg',
 'Holstein-Ploen',
 'Holstein-Ploens',
 'Horn-work',
 'Horse-Granadiers',
 'Horse-back',
 'Horse-men',
 'H•a•s',
 'Kings-Armes',
 'Liege-Dragoons',
 'Lieutenant-Colonel',
 'Lieutenant-Colonels',
 'Lieutenant-General',
 'Lieutenant-Generals',
 'Life-G',
 'Life-Gua',
 'Life-Guard',
 'Life-Guards',
 'Life-Regiment',
 'Life

Analyzing the results above, we see a few more odd characters that most likely should be something else. Going through the list, we'll need to:
1. Decide whether we're ok with having accented characters, or whether we want to 'flatten' them into the ASCII equivalent, e.g. `é` and `è` and `ë` all become `e`.
2. Decide what to do about the dashes. Note that there are *multiple* types of dashes - do we want to retain them, or just flatten them into one standard dash? I say flatten.
3. Explore a few other characters further, like this weird one: `'̄`. You can look up the name and description of any character by simply selecting it and pasting it into a Google search. 

So if we were running this on D'Auvergne's campaign narrative of 1694, we'd need to check characters like the ampersand `&`, the asterisk `*`, a macron thingie `'̄`, the circle `•`, the square `▪`, and those angled brackets `〈`...

Ideally, if we wanted to do the exact same type of correction for all of the above characters, we could make a single bit of code that would walk through each: "Replace each weird character with whatever in `text` and save the old and new to `changesdict`." But, unfortunately, we don't always know what to do with each weird character, and sometimes we need to do different things for a specific weird character based on what letters are around it. This means we'll need to correct these characters one type at a time.

You can create new sections of the code below for any additional characters that appear in your source.

# Replace `&` ampersand

For a start, we might want to replace the `&` - the most obvious solution would be to repalce it with its full version: `and`. But before we do that, however, we should check to make sure that we don't have a `&` for some other reason. For example, some typefaces have a `ct` ligature, which looks an awful lot like `&t` to a computer. So we should check for that *before* we do a batch update of `&`. As this example illustrates, it helps to know your text before you start blindly making changes to it.

In [70]:
re.findall(r'\b\S*&\S*',text)

[]

Look for `&t` ligature

In [71]:
ampdict = {}
for i in re.findall(r'\S*&[aeiou]\S*',text):
    ampdict[i] = i.replace('&','ct')
ampdict

{}

Find `&` as `etc.`

In [72]:
for i in re.findall(r'&c',text):
    ampdict[i] = i.replace('&c','etc')
ampdict

{'&c': 'etc'}

Create the empty subdict

In [73]:
changesdict['ampersand'] = {}

Call the `lexiconreplaceassign` function to make the substitution in `text` and add them to `changesdict`

In [74]:
text = lexiconreplaceassign('ampersand',changesdict,ampdict,text)
changesdict['ampersand']

{'&c': 'etc'}

Assuming you didn't find any `&t` ligatures in your document, we *could* go on to changing all those `&` to `and`. But maybe we should make sure by using a broader regex:

In [75]:
re.findall(r'\S*&\S*',text) #\S* means 'zero or more non-whitespace characters'. Narrower would be '[a-zA-Z]&[a-z]'

['&']

Now we can check for any remaining `&`. If we include padded spaces on either side, we'll find any `&` by themselves, which presumably means they should be `and`.

In [76]:
re.findall(r' & ',text)

[' & ']

In [77]:
ampdict1 = {} #NB: instead of a new ampdict1, we could add this to the previous ampdict and then call the function after
for i in re.findall(r' & ',text): #NB: here we skipped creating the amplist and just put the regex in our for loop, since findall returns a list
    ampdict1[i] = i.replace(' & ',' and ')
ampdict1

{' & ': ' and '}

In [78]:
text = lexiconreplaceassign('ampersand',changesdict,ampdict1,text)
changesdict['ampersand']

{'&c': 'etc', ' & ': ' and '}

In [79]:
edits += '&'
text_a = text

# `*` Asterisk

With poor-quality OCR, some characters or stray marks may be interpreted as `*`. Check to see if those need correction.

Find all `*` that are connected to a word. Note that since the `*` character is also a special class in regex, i.e. it's shorthand for "find the previous character zero or more times", we need to 'escape' it, by adding a slash `\` in front of it.

In [80]:
re.findall(r'\S*\*\S*',text)

['*here', '*Boors', '[*Note,']

We could expand the regex and see a bit more context around these asterisk words.

In [81]:
re.findall(r'\S* \S*\*\S* \S*',text) # Each <space>\S*<space> adds another word of context: 'any non-whitespace, zero or more times'

['Report *here last', 'The *Boors had']

Examining the results, we see that sometimes we want to delete the `*`, other times we might want to replace it with a space. We might even want to retain the `*` if it's being used as a footnote marker. To keep it simple, I'll use a conditional `if` statement: either replace the `*` with nothing (`''`) if it's followed by a space, otherwise replace the `*` with a space (`' '`).

Create replacement dict

In [82]:
astdict = {}
for i in re.findall(r'\S* \S*\*\S* \S*',text):
    if re.search(r'\S* \S*\* \S*',i): # if * followed by space
        astdict[i] = i.replace('*','')
    else: # could also make elif with regex, to be more targeted
        astdict[i] = i.replace('*',' ')
astdict

{'Report *here last': 'Report  here last', 'The *Boors had': 'The  Boors had'}

In [83]:
len(astdict)

2

In [84]:
changesdict['asterisk'] = {}

In [85]:
text = lexiconreplaceassign('asterisk',changesdict,astdict,text)
changesdict['asterisk']

{'Report *here last': 'Report  here last', 'The *Boors had': 'The  Boors had'}

After we've made the above corrections, we can craft a broader regex to search for any other `*` we might have missed - our regex above required having spaces and non-whitespace characters on either side.

In [86]:
re.findall(r'\S*\*\S*',text)

['[*Note,']

We could change these generically: replace the remaining `*` with a space. Extra spaces aren't a major concern. If desired, we can convert all multiple-space sequences to a single space later in the code.

In [87]:
astdict1 = {}
for i in re.findall(r'\S*\*\S*',text):
    astdict1[i] = i.replace('*',' ')
astdict1

{'[*Note,': '[ Note,'}

In [88]:
text = lexiconreplaceassign('asterisk',changesdict,astdict1,text)
changesdict['asterisk']

{'Report *here last': 'Report  here last',
 'The *Boors had': 'The  Boors had',
 '[*Note,': '[ Note,'}

In [89]:
edits += '*'
text_a = text

# Replace `?`

Sometimes the question mark is a valid choice to end a sentence - or is it? Other times, it might be used by a data entry person to indicate an uncertain character. In the EEBO TCP, some of the documents have many question marks that we'd like to get rid of. Let's find out.

First, let's just get a quick count of all the `?`s that appear in `text`.

In [90]:
len(re.findall(r'\?',text))

4

Of course, we need to distinguish between legitimate sentence-ending `?`s vs. those in the middle of a sentence. So the following regex will only ask for a string with a `?` if it is *not* followed by a space or a capital letter.

In [91]:
quest = []
for i in re.findall(r'\S*\?[^ A-Z] \S*',text):
    quest.append(i)
quest

[]

In [92]:
len(quest)

0

Let's just make sure our regex is working:

In [93]:
re.findall(r'\S*\?\S* \S*',text)

['pleases? I', 'late? If', 'it? If', 'Foundation? We']

Create replacement dict.

In [94]:
questdict = {}
for i in quest:
    questdict[i] = i.replace('?','')
questdict

{}

In [95]:
changesdict['quest'] = {}

In [96]:
text = lexiconreplaceassign('quest',changesdict,questdict,text)
changesdict['quest']

{}

Alternatively, you could see if there are different types of `?` errors requiring different types of corrections. Then you'd save the above list to a separate file, and edit them by hand.

In [97]:
edits += 'q'
text_q = text

# Replace `$` with `s`

Obviously dollar signs should only be deleted for particular corpora, depending on your geographical and chronological focus.

If you think there might be many, you can start with a more precise search, worried that you'll need to make different changes, depending on the context.

In [98]:
re.findall(r'\b\S*\$\S*\b',text) #NB need to escape $ in regex

[]

But if none appear in your narrow search, or if you want to start with a broader search:

In [99]:
re.findall(r'\$',text)

[]

Create a lexicon dict and add any changes to the `changesdict`

In [100]:
dollardict = {}
for i in re.findall(r'\b\S*\$\S*\b',text):
    dollardict[i] = i.replace('$','')
dollardict

{}

In [101]:
changesdict['dollar'] = {}

In [102]:
text = lexiconreplaceassign('dollar',changesdict,dollardict,text)
changesdict['dollar']

{}

In [103]:
edits += '$'

In [104]:
text_S = text

# Replace `▪` Squares

Other 'gremlins' might get added. Check to see if these characters can simply be deleted (easiest), or if they need to be substituted for another character (harder, unless it's always the same character).

In [105]:
re.findall(r'▪',text)

['▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪', '▪']

Depending on the document, there might be a lot of squares, so let's get a bit more context.

In [106]:
square = re.findall(r'\S* \b\S*[▪]\W*\S*\b \S*',text)
square

['The 12th▪ Fourteen Battalions',
 'in Maestricht▪ and our',
 'that over▪looks the',
 'one another▪ A little',
 'and re▪pass the',
 'a Troop▪) The French',
 'by Major▪General Ramsay,',
 'our Camp▪ The Enemy',
 'laying Siege▪to Huy,',
 'they look▪d for,',
 'the Fortification▪ Count Thian',
 'an Invasion▪ There was',
 'of Willebrook▪ and near']

In [107]:
len(square)

13

NB: Remember that you can get more context by extending the regex with more `\S* ` inside the word boundaries (`\b`) on each side

In some documents, this character is rather problematic, since some need deletion and other need spaces, or dashes, or even other letters. If there aren't that many items in `square` (you can return the `len` if you make the results a list), and if you're a stickler for detail, you can do each one separately.

This is where pattern recognition comes in. Maybe you notice that some of the squares should be replaced with a period. Maybe you realize that the following rule will find those where the square is preceded by a lowercase letter and then followed by a space, and then an uppercase letter (note that that uppercase letter is important, otherwise the `Maestricht▪ and` would be converted to a period, like `Maestricht. and`). More problematically, maybe, upon further reflection, you realize that something like `Fortification▪ Count` could be something like `Moving toward the Fortification Count Horn briskly...`. In other words, I'm not sure how to automate this type of error.

One option would be to manually add each errors and corrections. You could simply copy & paste each line, and edit the `old` and `replacement` values in each line. But this would need to be changed for each document.

In [108]:
# if newdoc == 'N':
    # text = addextra('square',changesdict,'12th▪','12th',text)
    # text = addextra('square',changesdict,'Maestricht▪','Maestricht',text)
    # changesdict['square']

Instead, we'll create a separate file with all the square errors, make the appropriate corrections in a text editor, then read them back in as a lexicon to substitute.

So we'll:
1. Write `square` list to a separate file. 
2. Rename that file to `square1` - will avoid overwriting if we need to modify code
2. Ignore any items we don't want to change (quicker than deleting the false positives)
3. Correct those you do want to change
4. Read this `square1` back in as a dict
5. Run the `lexiconreplaceassign` function on this new dict

In [109]:
squaredict = {}
for i in re.findall(r'\S* \b\S*[▪]\W*\S*\b \S*',text):
    squaredict[i] = i
squaredict

{'The 12th▪ Fourteen Battalions': 'The 12th▪ Fourteen Battalions',
 'in Maestricht▪ and our': 'in Maestricht▪ and our',
 'that over▪looks the': 'that over▪looks the',
 'one another▪ A little': 'one another▪ A little',
 'and re▪pass the': 'and re▪pass the',
 'a Troop▪) The French': 'a Troop▪) The French',
 'by Major▪General Ramsay,': 'by Major▪General Ramsay,',
 'our Camp▪ The Enemy': 'our Camp▪ The Enemy',
 'laying Siege▪to Huy,': 'laying Siege▪to Huy,',
 'they look▪d for,': 'they look▪d for,',
 'the Fortification▪ Count Thian': 'the Fortification▪ Count Thian',
 'an Invasion▪ There was': 'an Invasion▪ There was',
 'of Willebrook▪ and near': 'of Willebrook▪ and near'}

In [110]:
with open(outputpath + filename + '_square.csv', 'w', encoding = 'UTF-8') as f:
    writer = csv.writer(f)
    for k,v in squaredict.items():
        writer.writerow([k, v])

Create an empty dict

In [111]:
changesdict['square'] = {}

After you've changed the items in text editor, you reset the `Set flag` newdoc to `Y` and run the rest of the notebook. The `if` statement will run the following cell.

In [112]:
'▪' in text

True

In [113]:
with open (outputpath + filename + '_square1.csv','r') as subs:
    reader = csv.reader(subs)
    square1 = {rows[0]:rows[1] for rows in reader}
square1

{'The 12th▪ Fourteen Battalions': 'The 12th Fourteen Battalions',
 'in Maestricht▪ and our': 'in Maestricht and our',
 'that over▪looks the': 'that overlooks the',
 'one another▪ A little': 'one another. A little',
 'and re▪pass the': 'and repass the',
 'a Troop▪) The French': 'a Troop) The French',
 'by Major▪General Ramsay,': 'by Major-General Ramsay,',
 'our Camp▪ The Enemy': 'our Camp. The Enemy',
 'laying Siege▪to Huy,': 'laying Siege to Huy,',
 'they look▪d for,': 'they looked for,',
 'the Fortification▪ Count Thian': 'the Fortification Count Thian',
 'an Invasion▪ There was': 'an Invasion. There was',
 'of Willebrook▪ and near': 'of Willebrook, and near'}

In [114]:
changesdict['square'] = {}

In [115]:
if newdoc == 'N':
    with open (outputpath + filename + '_square1.csv','r') as subs:
        reader = csv.reader(subs)
        square1 = {rows[0]:rows[1] for rows in reader}
    text = lexiconreplaceassign('square',changesdict,square1,text)

In [116]:
changesdict['square']

{'The 12th▪ Fourteen Battalions': 'The 12th Fourteen Battalions',
 'in Maestricht▪ and our': 'in Maestricht and our',
 'that over▪looks the': 'that overlooks the',
 'one another▪ A little': 'one another. A little',
 'and re▪pass the': 'and repass the',
 'a Troop▪) The French': 'a Troop) The French',
 'by Major▪General Ramsay,': 'by Major-General Ramsay,',
 'our Camp▪ The Enemy': 'our Camp. The Enemy',
 'laying Siege▪to Huy,': 'laying Siege to Huy,',
 'they look▪d for,': 'they looked for,',
 'the Fortification▪ Count Thian': 'the Fortification Count Thian',
 'an Invasion▪ There was': 'an Invasion. There was',
 'of Willebrook▪ and near': 'of Willebrook, and near'}

There may be one or two others, given the regex we used above. This should get caught by later code.

In [117]:
re.findall(r'\S*▪\S*',text)

['Salisch▪']

In [118]:
if newdoc == 'N':
    edits += '▪'
    text_s = text

# Delete `•` and `·` circles

Find all circles, with some surrounding context - on the assumption that circles `•` might be as complicated as their similarly-shaped square `▪` cousins.

In [119]:
len(re.findall('[•·]',text))

11

Are there any circles with non-whitespace character around them?

In [120]:
circle = re.findall(r'\S* \b\S* {0,1}[•·] {0,1}\S* \S*\b',text)
circle

['foment H•a•s and',
 'Colonel No•im, who',
 'de No•ailles got',
 'at Tav•ers, upon',
 'Lewenhaupt, Olim• Koningsmark',
 'Major-General Sonsfeldt•; the',
 'have compass•d so',
 'sent Order• to Maestricht',
 'Baggage march•d by',
 'times attack•d by']

In [121]:
len(circle)

10

As feared, here too, we might need to do different things with the circles, requiring separate edits.

## Create `circledict` and change

We'll use the same process as with squares: save as `circles`, rename as `circles1`, make corrections, read back in and substitute.

In [122]:
circledict = {}
for i in re.findall(r'\S* \b\S* {0,1}[•·] {0,1}\S* \S*\b',text):
    circledict[i] = i
circledict

{'foment H•a•s and': 'foment H•a•s and',
 'Colonel No•im, who': 'Colonel No•im, who',
 'de No•ailles got': 'de No•ailles got',
 'at Tav•ers, upon': 'at Tav•ers, upon',
 'Lewenhaupt, Olim• Koningsmark': 'Lewenhaupt, Olim• Koningsmark',
 'Major-General Sonsfeldt•; the': 'Major-General Sonsfeldt•; the',
 'have compass•d so': 'have compass•d so',
 'sent Order• to Maestricht': 'sent Order• to Maestricht',
 'Baggage march•d by': 'Baggage march•d by',
 'times attack•d by': 'times attack•d by'}

In [123]:
with open(outputpath + filename + '_circle.csv', 'w', encoding = 'UTF-8') as f:
    writer = csv.writer(f)
    for k,v in circledict.items():
        writer.writerow([k, v])

In [124]:
changesdict['circle'] ={}

In [125]:
if newdoc == 'N':
    with open (outputpath + filename + '_circle1.csv','r') as subs:
        reader = csv.reader(subs)
        circle1 = {rows[0]:rows[1] for rows in reader}
    text = lexiconreplaceassign('circle',changesdict,circle1,text)

In [126]:
changesdict['circle']

{'foment H•a•s and': 'foment Has and',
 'Colonel No•im, who': 'Colonel Noim, who',
 'de No•ailles got': 'de Noailles got',
 'at Tav•ers, upon': 'at Tavers, upon',
 'Lewenhaupt, Olim• Koningsmark': 'Lewenhaupt, Olim Koningsmark',
 'Major-General Sonsfeldt•; the': 'Major-General Sonsfeldt; the',
 'have compass•d so': 'have compassed so',
 'sent Order• to Maestricht': 'sent Order to Maestricht',
 'Baggage march•d by': 'Baggage marched by',
 'times attack•d by': 'times attacked by'}

Any missed by the regex should get caught later on. Or, you could improve the regex.

In [127]:
if newdoc == 'N':
    edits += 'c'
    text_c = text

# Delete angled brackets `〈`

Angled brackets could also be a problem. Here we do something very specific.

In [128]:
changesdict['blankpage'] = {}
text = addextra('blankpage',changesdict,'〈1 page duplicate〉',' ',text)
changesdict['blankpage']

{'〈1 page duplicate〉': ' '}

In [129]:
text_bp = text

# Convert `¶`

It's possible there are paragraph marks within your text.

In [130]:
re.findall(r'\S*¶\S*',text)

[]

In [131]:
paradict = {}
for i in re.findall(r'\S*¶\S*',text):
    paradict[i] = i.replace('¶','')
paradict

{}

In [132]:
changesdict['para'] = {}

In [133]:
text = lexiconreplaceassign('para',changesdict,paradict,text)
changesdict['para']

{}

In [134]:
edits += 'p'
text_p = text

# Other weird (non-ASCII) characters to consider

After we've made targeted corrections, let's see what other non-ASCII characters we need to deal with. Even if we want to keep some accented characters, there are probably others that we should change, because they are most likely OCR/entry errors. For example, my texts are primarily in English and French circa 1700, which means that there are many non-English characters that contemporaries would *not* have used. We can search for them with regex.

Just as an example, to find any foreign `c` characters: `ćĉċč`, excluding the French `ç`:

In [135]:
re.findall(r'\b\S*[ćĉċč]\S*\b',text)

[]

But we won't necessarily know *which* weird characters will be in any given text.

One simple way to find all the non-ASCII characters is to use the `findall` method for all non-ASCII characters.

In [136]:
nonascii = set(re.findall(r'[^\x00-\x7F]',text))
nonascii

{'ç', 'è', 'é', '̄', '—', '▪'}

Note that you could have actually done this earlier in this notebook, and you would have found the divisor, the degree sign, and so on. But you want to keep separate those character errors that can be programmatically corrected (e.g. `Ʋ` always is a `U` - have the code automatically correct them) with character errors that require individual attention. Generally, make specific, targeted, corrections before general braod-brush changes.

Let's assume that the other character errors are unknown, at this stage. Let's find them and read them into a dict with the error in both key and value. We'll then write that dict out to a separate file, and we can just change the single character error, instead of retyping the entire key. Open file, make corrections manually, read back in as lexicon and run to correct.

In [137]:
nonasciidict = {}
for i in re.findall(r'\S* \S*[^\x00-\x7F]\S* \S*',text):
    nonasciidict[i] = i
nonasciidict

{'Ittersum, Warfusé, Hubert,': 'Ittersum, Warfusé, Hubert,',
 'Colonels Dompré, Roo,': 'Colonels Dompré, Roo,',
 'St. André, all': 'St. André, all',
 'St. André, that': 'St. André, that',
 'St. André to': 'St. André to',
 'St. André upon': 'St. André upon',
 'St. André, the': 'St. André, the',
 'St. André and': 'St. André and',
 'St. André, and': 'St. André, and',
 'by Mellé towards': 'by Mellé towards',
 'to Condé, where': 'to Condé, where',
 'and Condé, they': 'and Condé, they',
 'from Condé to': 'from Condé to',
 'St. André, Major-General': 'St. André, Major-General',
 'St. André towards': 'St. André towards',
 'Major-General Cohorné had': 'Major-General Cohorné had',
 'the Fren̄ch begin': 'the Fren̄ch begin'}

In [138]:
len(nonasciidict)

17

Depending on the document, some of these characters may be replaced with a single character. Others, however, may require much more effort, since a single weird character might need to be replaced with many different characters, depending on the instance. The `ï` might be `o`, `i`, `s`, `e`, `u`..., depending on the context!

If you do see a pattern,  e.g. many of the `ï` are `i`, you can do a regex replace, either here or in your text editor, and then manually correct those few that were corrected to `i`, but should actually be something else. You could also check to see if some of the words repeat themselves, e.g. maybe `Generalï` appears multiple times. Generally, try to minimize the amount of time you spend doing hand-correction.

For the time being, we can write this list to a file, that we can look at and decide what to do.

In [139]:
with open(outputpath + filename + '_unicode_to_check_list.csv', 'w', encoding = 'UTF-8') as f:
    writer = csv.writer(f)
    for k,v in nonasciidict.items():
            writer.writerow([k, v])

After we have manually corrected the `check_list.csv`, we can change the `newdoc` value to `Y` (at the beginning of the notebook), and then rerun the entire notebook. This will run the code below, which will run the following two cells:

In [140]:
changesdict['unicode'] ={}

In [141]:
if newdoc == 'N':
    with open (outputpath + filename + '_unicode_to_check_list1.csv','r') as subs:
        reader = csv.reader(subs)
        unicode = {rows[0]:rows[1] for rows in reader} # 1st row as key, 2nd row as value
    text = lexiconreplaceassign('unicode',changesdict,unicode,text)

In [142]:
changesdict['unicode']

{'Ittersum, Warfusé, Hubert,': 'Ittersum, Warfusé, Hubert,',
 'Colonels Dompré, Roo,': 'Colonels Dompré, Roo,',
 'St. André, all': 'St. André, all',
 'St. André, that': 'St. André, that',
 'St. André to': 'St. André to',
 'St. André upon': 'St. André upon',
 'St. André, the': 'St. André, the',
 'St. André and': 'St. André and',
 'St. André, and': 'St. André, and',
 'by Mellé towards': 'by Mellé towards',
 'to Condé, where': 'to Condé, where',
 'and Condé, they': 'and Condé, they',
 'from Condé to': 'from Condé to',
 'St. André, Major-General': 'St. André, Major-General',
 'St. André towards': 'St. André towards',
 'Major-General Cohorné had': 'Major-General Cohorne had',
 'the Fren̄ch begin': 'the French begin'}

In [143]:
if newdoc == 'N':
    edits += 'u'
    text_u = text

# Add space between punctuation

Sometimes punctuation might need to be padded;like this. Do apostrophes separately, because they are complicated by things like elision.

In [144]:
re.findall('\S*[a-z][,;][a-z]\S*',text)

[]

In [145]:
puncdict = {}
puncpatt1 = re.compile(r'(\S*[a-z][,;])([a-z]\S*)')
for i in re.findall(r'\S*[a-z][,;][a-z]\S*',text):
    puncdict[i] = re.sub(puncpatt1,r'\1 \2',i)
puncdict

{}

In [146]:
changesdict['puncspace'] = {}

In [147]:
text = lexiconreplaceassign('puncspace',changesdict,puncdict,text)
changesdict['puncspace']

{}

In [148]:
edits += '_'
text__ = text

# Remove duplicate apostrophes

Reduce duplicate apostrophes to a single apostrophe, to be cleaned later.

In [149]:
duplapost = {}
for i in re.findall(r'\b\S*\'{2,} [a-zA-Z]\S*\b',text):
    duplapost[i] = i.replace("''","'")
duplapost

{}

In [150]:
changesdict['duplapost'] = {}

In [151]:
text = lexiconreplaceassign('duplapost',changesdict,duplapost,text)
changesdict['duplapost']

{}

In [152]:
edits += "'"
text_a = text

# Convert double-letter to single letter

Dangerous, given some legit double-letter words

In [153]:
doubleletter = re.findall(r'\b([a-zA-Z])\1{2}',text)
doubleletter

['I']

This could be worked on.

# Convert `vv` to `w`

There are a few double-letters that we want converted to a known single character.

In [154]:
vvdict = {}
for i in re.findall(r'\S* \S*[vV]{2}\S* \S*',text):
    vvdict[i] = re.sub('[vV]{2}','w',i)
    #vvdict[i] = i.replace('VV','W')
vvdict

{'and VVounded, at': 'and wounded, at',
 'and VVounded that': 'and wounded that',
 'and VVounded who': 'and wounded who'}

In [155]:
changesdict['vv'] = {}

In [156]:
text = lexiconreplaceassign('vv',changesdict,vvdict,text)
changesdict['vv']

{'and VVounded, at': 'and wounded, at',
 'and VVounded that': 'and wounded that',
 'and VVounded who': 'and wounded who'}

In [157]:
edits += 'v'
text_vv = text

## `ó`

This letter might be extraneous (delete), but might also actually be an `o` - how distinguish with code?

Another correction to think about before automating.

In [158]:
weirdo = re.findall(r'\b\S*ó\S*',text)
weirdo

[]

In [159]:
weirdodict = {}
for i in re.findall(r'\b\S*ó\S*',text):
    weirdodict[i] = i.replace('ó','o')
weirdodict

{}

In [160]:
changesdict['weirdo'] = {}

In [161]:
text = lexiconreplaceassign('weirdo',changesdict,weirdodict,text)
changesdict['weirdo']

{}

In [162]:
edits += 'o'
text_o = text

# Standardize Dashes and Hyphens

Correcting dashes is a more challenging task. As with the previous punctuation, there are multiple variants of the dash/hyphen. In fact, there are *four* different dash characters (hyphen `-`, minus-sign `−`, em dash `—`, en dash `–`) that are practically indistinguishable to the untrained eye. But OCR/entry might interpret the same character differently from instance to instance, and sometimes printers would use them in different scenarios as well. So depending on your purpose and genre of text, you might want to keep some of the dashes, but delete others. Or, in this case, you just want to convert them all to the same type of dash.

Further complicating matters, there will be a lot of dashes in published texts: compound toponyms like `Saint-Omer` or people like `Jean-Claude`. French and foreign phrases might have correct dashes to keep, `n'est-ce pas?`. 
Summarizing acceptable uses of dashes:
1. Foreign phrases (`est-ce que`, `n'est-ce pas`)
2. Names (`Mérode-Westerloo`, `Newcastle-upon-Tyne`)
3. Year ranges (`1700-1704`)
4. Titles (`aide-de-camp`...)
5. British spellings (e.g. `co-opt`)

Depending on the quality of the OCR/transcription, you might also find random dashes (e.g. `the-only`) that require correction. But you don't want to make a mass update of every dash between two letters, for fear of deleting the acceptable dashes mentioned above.

For early modernists using EEBO/ECCO, we discover that since the Text Creation Partnership was interested in the physical layout of the printed page, as well as the text on it, they faithfully recorded the hyphenated splits of words across lines. You'll find the same on OCRed text. You can use regex to catch simple version like `compound` thus: `com-\npound`, or, more generally, `[a-z]-\n[a-z]` will find every dash and line break surrounded by a lowercase letter on either side. VEP has eliminated those.

Unfortunately, those line breaks may have further complications, particularly if you have OCRed text. As mentioned earlier, tight book bindings can lead to dark shadows in the margin/gutter on either side of the page, which might be interpreted as characters, which could require your regex to find a scenario like this: ` com- i`

`* pound`. 

A word might be split across more than one line break, if, for example, the word breaks across two pages. In that case, you'd need to clear out the header and footer text first, as well as any gutter cruft, before rejoining the split word.

In other words, there are a few issues to address. So you might want to treat them differently, and in steps. We'll do the following:
1. Standardize all the dashes. I'm not interested in dash variations.
2. Then, we can use a substitution lexicon to replace dashes. In some instances, we'll want to simply delete the dash; others might require converting the dash into a space, and some might convert the dash into an underscore if you want to group words together, distinguishing `Saint_Omer` from `Saint Denis`. This underscoring will be done in the next notebook.

### Standardize dashes

Standardize dashes in full text string, so that if you search for a 'dash', you will find everything that looks like a dash, regardless of how it is used. But if your document uses dashes intelligently, skip this step.

First, find all normal (minus sign, or `-`) hyphens, to see the variations.

In [163]:
dash = []
for i in re.findall(r'\b\S* {0,2}- {0,2}\S*\b',text):
    dash.append(i)
set(dash)

{'1660-1737',
 'Artillery-Horses',
 'Battering-Pieces',
 'Bernstorf-Zell',
 'Blood-Royal',
 'Bomb-Ketches',
 'Bread-Waggons',
 'Breast-works',
 'Burgh-Masters',
 'Coal-pits',
 'Colonel-General',
 'Commissary-General',
 'Court-Marshal',
 'Day-time',
 'Draw-Bridge',
 'Drinking-Money',
 'Earth-work',
 'Field-Officer',
 'Field-Pieces',
 'Field-pieces',
 'Fire-Armes',
 'Fitz-Patrick',
 'Fleet-street',
 'Flower-de-luces',
 'Frosty-weather',
 'Great-Britain',
 'Guards-Room',
 'Head-Quarter',
 'Head-dress',
 'Holstein-Beck',
 'Holstein-Norbourg',
 'Holstein-Norburg',
 'Holstein-Ploen',
 'Holstein-Ploens',
 'Horn-work',
 'Horse-Granadiers',
 'Horse-back',
 'Horse-men',
 'Kings-Armes',
 'Liege-Dragoons',
 'Lieutenant-Colonel',
 'Lieutenant-Colonels',
 'Lieutenant-General',
 'Lieutenant-Generals',
 'Life-G',
 'Life-Gua',
 'Life-Guard',
 'Life-Guards',
 'Life-Regiment',
 'Life-guards',
 'Livery-Coats',
 'MAJOR-GENERAL',
 'Major-General',
 'Major-Generals',
 'Mid-night',
 'Mons-Port',
 'Mortar-piec

In [164]:
len(dash)

243

Some of the above hyphens might be valid uses, e.g. ranks and entities, while others could be archaic hyphenations that we'll want to add to our substitution list, e.g. `Public-Houses` to `public houses` (or possibly `public_houses`).

Next, make a list for all non-hyphen words that we should standardize. (In case we worry about dashes interacting with tokenization, we can just use regex to search over the entire `text` string.)

In [165]:
re.findall(r'[−—–]',text) #the 4 types of dashes

['—', '—', '—', '—']

In [166]:
nonhyphen = []
for i in re.findall(r'\b\S*\W*[−—–]\W*\S*\b',text): # escape the hyphen in regex
    nonhyphen.append(i)
nonhyphen

['3\n—\t2', '3\n—de', '1\n—\t1', '1\n—\t1']

It looks like this type of dash is (mostly) used between a line break (`\n`) and a tab (`\t`).

In [167]:
len(nonhyphen)

4

Standardize all dashes to hyphen.

## Create `dashdict` and standardize

Given the variety of options, this is probably just as easy to code in an `if...elif` loop.

In [168]:
pattern = re.compile(r'(\S* {0,1})[–−−]( {0,1}\S*)') # regex pattern to use later in code; () groups into 2 groups
nonhyphendict = {}
for i in nonhyphen:
    if '−' in i:
        nonhyphendict[i] = i.replace('−','-')
    elif '—' in i:
        nonhyphendict[i] = i.replace('—','-')
    elif '–' in i:
        nonhyphendict[i] = i.replace('–','-')
text = re.sub(pattern,r'\1\2',text) # after made change dict, replace pattern with groups 1 & 2 of pattern (i.e. without space)
edits += "-"
nonhyphendict

{'3\n—\t2': '3\n-\t2', '3\n—de': '3\n-de', '1\n—\t1': '1\n-\t1'}

In [169]:
len(nonhyphendict)

3

Checkpoint save

In [170]:
text_d = text

Now that we've standardized the dashes, we can start replacing them more systematically. We'll use a large lexicon derived from Ted Underwood's `DataMunging` work: https://github.com/tedunderwood/DataMunging/tree/master/rulesets. It's one of many `csv` files that are formatted thus: `incorrect string,correct string`. These values are read into a dict, and then used to substitute the incorrect for the correct.

For these types of lexica, there are several complications that require some thought:
1. Whether you pad the entries with spaces (leading/trailing) or not. This is particularly important regarding substrings within other words. For example,`_ ham_` (underscore indicating a space) would not match `I_ate_ham.`, with ham at the end of a sentence. But finding and replacing every instance of `him` with `ham` (without the padded spaces) would also substitute the substring `ham` in `sham` and `hamlet`, resulting in `shim` and `himlet`. Unpadded substitutions might change many more strings than you intend; on the other hand, padded substitutions may not match every instance you want to change, due to surrounding punctuation.
2. The order in which you run your edits, i.e. the order in which entries are sorted in your lexicon. If you don't use padded spaces, consider running more precise (i.e. longer strings) first, e.g. `_hamlet_`, before shorter strings with `ham` in them.
3. Possessives, plurals, etc. If you include trailing spaces, you will not catch plurals and the like. Normally you would deal with this problem by tokenizing your text and then lemmatizing it (e.g. converting all verb forms into the infinitive...). But if your text is really dirty and if you want to create a clean `txt` version, it might have unpredictable results. Another option is to duplicate your lexicon entries in the substitution list and then add the endings. You can even do this using some Python code.
4. Capitalization - do you need a separate capitalized version of each old/new pair (`hello`, `Hello`)? You could make everything lower case (`lower` method), but then you'd lose any capitalized words that you might want to identify later, e.g. proper nouns with named entity recognition. I went with having both lower and title case in my lexica.

There are different ways you can deal with these: wrap the substring inside word boundaries (`\b`), make the larger string substitutions before the smaller ones, etc. Trial and error, and some regex research, is your best guide. Be sure to check out the results before you go on to the next step.

## Substitute `hyphen2concat`

This substitutes any hyphenated words into single words, based off Underwood's original lexicon. Use the separate `hyphen2underscore` dict to retain hyphens but convert them to an underscore.

In [171]:
hyphenreplace = {}
with open (lexicapath + 'hyphen2concat.csv','r') as subs:
    reader = csv.reader(subs)
    hyphenreplace = {rows[0]:rows[1] for rows in reader}
hyphenreplace

{" on't ": ' on it ',
 " On't ": ' on it ',
 '1700-1': '1700-1701',
 '1701-2': '1701-1702',
 '1702-3': '1702-1703',
 '1703-4': '1703-1704',
 '1704-5': '1704-1705',
 '1705-6': '1705-1706',
 '1706-7': '1706-1707',
 '1707-8': '1707-1708',
 '1708-9': '1708-1709',
 '1709-10': '1709-1710',
 '1710-11': '1710-1711',
 '1711-12': '1711-1712',
 'a-bed': 'abed',
 'A-bed': 'abed',
 'a-breast': 'abreast',
 'A-breast': 'abreast',
 'a-field': 'afield',
 'A-field': 'afield',
 'a-foot': 'afoot',
 'A-foot': 'afoot',
 'a-ground': 'aground',
 'A-ground': 'aground',
 'a-head': 'ahead',
 'A-head': 'ahead',
 'a-kin': 'akin',
 'A-kin': 'akin',
 'a-lee': 'alee',
 'A-lee': 'alee',
 'a-long-side': 'alongside',
 'A-long-side': 'alongside',
 'a-longside': 'alongside',
 'A-longside': 'alongside',
 'a-piece': 'apiece',
 'A-piece': 'apiece',
 'a-stern': 'astern',
 'A-stern': 'astern',
 'a-wash': 'awash',
 'A-wash': 'awash',
 'ablebodied': 'able-bodied',
 'Ablebodied': 'able-bodied',
 'above-ground': 'aboveground',
 'A

In [172]:
len(hyphenreplace)

3876

Show all the items from the above dict that are in `text`, i.e. items to be changed. If there are items in this dict that you don't want to (ever) be changed, you can delete them from the lexicon file, and reload the previous code. But then you'll need to think about whether you want to keep multiple versions of each lexicon or not.

In [173]:
hyphendict = {}
for k,v in hyphenreplace.items():
    if k in text:
        hyphendict[k] = v
hyphendict

{'Breast-work': 'breastwork',
 'Breast-works': 'breastworks',
 'Burgh-Master': 'burgomaster',
 'Coal-pit': 'coal-pit',
 'Day-time': 'daytime',
 'Draw-Bridge': 'drawbridge',
 'Earth-work': 'earthwork',
 'Fire-Armes': 'firearms',
 'Frosty-weather': 'frosty weather',
 'Guards-Room': 'guardroom',
 'half-way': 'halfway',
 'Head-dress': 'headdress',
 'Head-Quarter': 'headquarters',
 'Horn-work': 'hornwork',
 'Horse-back': 'horseback',
 'Horse-men': 'horsemen',
 'Mid-night': 'midnight',
 'out-number': 'outnumber',
 'Pistols-worth': "pistol's worth",
 'Publick-Houses': 'public-houses',
 're-mount': 'remount',
 're-pass': 'repass',
 'Rear-guard': 'rearguard',
 'Rear-Guard': 'rearguard',
 'Three-pence': 'threepence',
 'Twelve-month': 'twelvemonth',
 'under-valued': 'undervalued',
 'Wind-mill': 'windmill'}

In [174]:
len(hyphendict)

28

Run the `lexiconreplaceassign` function for hyphens

In [175]:
changesdict['hyphen'] = {}

In [176]:
text = lexiconreplaceassign('hyphen',changesdict,hyphendict,text)
changesdict['hyphen']

{'Breast-work': 'breastwork',
 'Burgh-Master': 'burgomaster',
 'Coal-pit': 'coal-pit',
 'Day-time': 'daytime',
 'Draw-Bridge': 'drawbridge',
 'Earth-work': 'earthwork',
 'Fire-Armes': 'firearms',
 'Frosty-weather': 'frosty weather',
 'Guards-Room': 'guardroom',
 'half-way': 'halfway',
 'Head-dress': 'headdress',
 'Head-Quarter': 'headquarters',
 'Horn-work': 'hornwork',
 'Horse-back': 'horseback',
 'Horse-men': 'horsemen',
 'Mid-night': 'midnight',
 'out-number': 'outnumber',
 'Pistols-worth': "pistol's worth",
 'Publick-Houses': 'public-houses',
 're-mount': 'remount',
 're-pass': 'repass',
 'Rear-guard': 'rearguard',
 'Rear-Guard': 'rearguard',
 'Three-pence': 'threepence',
 'Twelve-month': 'twelvemonth',
 'under-valued': 'undervalued',
 'Wind-mill': 'windmill'}

Check through the results and make sure there aren't any problems.

In [177]:
text.count('Breast-work')

0

In [178]:
edits += 'h'

In [179]:
text_d = text

# Check punctuation more broadly

If we're paranoid, we can look for words with specified punctuation (in square brackets in the regex) inside them.

See apostrophes below, along with elisions.

In [180]:
punc = re.findall(r'\b\S*[a-zA-Z][•!"#$%&()*+,./:;<=>?@[\]^_`{|}~]\S*\b',text)
for i in punc:
    print(i)

If needed, create an empty `changesdict['punc'] = {}` and manually add any unpredictable errors using `addextra`.

# Title case UPPERCASE words

Some of the edits are best done by tokenizing, i.e. breaking down the uninterrupted `text` string into tokens, but only to find the errors - you'll create a changedict based off the tokens, and then make the changes to `text`. Often these can be thought of as words, but depending on the tokenizer, they can also include punctuation, and words can even be split, e.g. the letters after apostrophes like `'s`.

There are many different types of tokenizers, and they are useful for particular tasks. NLTK's default word tokenizer will split text strings at every whitespace *or* punctuation. That means possessives (e.g. `Dean` and `s` from `Dean's`) and contractions (`can` and `t` from `can't`) will be separated. This may be useful depending on the type of analysis, but for cleaning text, we'd rather it not tokenize based off apostrophes. 

This means that we can clean our text either based off of the text string, or off of the tokens, or both.

Occasinally there will be words in uppercase, i.e. all caps. Decide if you want to convet them to Titlecase or not. You can check for words that are in ALL CAPS (greater than length of 2 so avoid `I`, `M.`, ...)

There are a few issues worth thinking about, scenarios in which you might want to keep multiple uppercase letters:
1. M. (Monsieur de ) or MM. for Messieurs.
2. A. (first name abbreviation) or A.M. 
3. Roman numerals (XVIII...).
4. Maybe used for EMPHASIS, which might be something you're interested in.

To identify case, we'll retokenize `text` (the current version) and then check for any tokens that were upper case and longer than two characters. Skim through the resulting list to see if there is anything out of place. 

If you are concerned about the tokenizer acting weird, you shouldn't rely on `words`, but instead should do a search across `text`.

In [181]:
tokens1 = nltk.word_tokenize(text)
tokens1

['The',
 'history',
 'of',
 'the',
 'campagne',
 'in',
 'the',
 'Spanish',
 'Netherlands',
 ',',
 'Anno',
 'Dom',
 '.',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne",
 ',',
 'Edward',
 ',',
 '1660-1737',
 '.',
 'By',
 'EDWARD',
 "D'AUVERGNE",
 ',',
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade',
 ',',
 'in',
 'the',
 'Isle',
 'of',
 'JERSEY',
 ',',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesties',
 'Regiment',
 'of',
 'Scots',
 'Guards',
 '.',
 'LONDON',
 ',',
 'Printed',
 'for',
 'Matt',
 '.',
 'Wotton',
 ',',
 'at',
 'the',
 'Three',
 'Daggers',
 ';',
 'and',
 'John',
 'Newton',
 ',',
 'at',
 'the',
 'Three',
 'Pigeons',
 ',',
 'near',
 'Temple-Barr',
 ',',
 'in',
 'Fleet-street',
 ',',
 '1694',
 '.',
 'Imprimatur',
 ',',
 'Novemb',
 '.',
 '20',
 '.',
 '1694',
 '.',
 'EDWARD',
 'COOKE',
 '.',
 '[',
 'page',
 ']',
 'To',
 'the',
 'Honourable',
 'MAJOR-GENERAL',
 'RAMSAY',
 ',',
 'Colonel',
 'of',
 'Their',
 'Majesties',
 'Regiment',


Use list comprehension to only keep non-puncutation.

In [182]:
words1 = [i for i in tokens1 if i not in (',','-',';',':','.','’','&','#','$','!','%','\'','*','•','(',')')]
words1

['The',
 'history',
 'of',
 'the',
 'campagne',
 'in',
 'the',
 'Spanish',
 'Netherlands',
 'Anno',
 'Dom',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne",
 'Edward',
 '1660-1737',
 'By',
 'EDWARD',
 "D'AUVERGNE",
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade',
 'in',
 'the',
 'Isle',
 'of',
 'JERSEY',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesties',
 'Regiment',
 'of',
 'Scots',
 'Guards',
 'LONDON',
 'Printed',
 'for',
 'Matt',
 'Wotton',
 'at',
 'the',
 'Three',
 'Daggers',
 'and',
 'John',
 'Newton',
 'at',
 'the',
 'Three',
 'Pigeons',
 'near',
 'Temple-Barr',
 'in',
 'Fleet-street',
 '1694',
 'Imprimatur',
 'Novemb',
 '20',
 '1694',
 'EDWARD',
 'COOKE',
 '[',
 'page',
 ']',
 'To',
 'the',
 'Honourable',
 'MAJOR-GENERAL',
 'RAMSAY',
 'Colonel',
 'of',
 'Their',
 'Majesties',
 'Regiment',
 'of',
 'Scots',
 'Guards',
 'etc',
 'SIR',
 'I',
 'Need',
 'not',
 'make',
 'an',
 'Apology',
 'for',
 'Presenting',
 'the',
 'Account',
 'of',

In [183]:
allcap = []
for i in words1:
    if i.isupper() and len(i) >2:
        allcap.append(i)
allcap

['EDWARD',
 "D'AUVERGNE",
 'JERSEY',
 'LONDON',
 'EDWARD',
 'COOKE',
 'MAJOR-GENERAL',
 'RAMSAY',
 'SIR',
 'KING',
 'SIR',
 'THE',
 'READER',
 'THIS',
 'BRUGES',
 'THE',
 'HISTORY',
 'THE',
 'CAMPAGNE',
 'THE',
 'OUR',
 'RIGHT',
 'WING',
 'FOOT',
 'LEFT',
 'WING',
 'RESERVE',
 'CAVALRY',
 'INFANTRY',
 'CAVALRY',
 'INFANTRY',
 'ARMY',
 'RIGHT',
 'WING',
 'DUKE',
 'HOLSTEIN',
 'RIGHT',
 'WING',
 'MARQUIS',
 'BEDMAR',
 'RIGHT',
 'WING',
 'RIGHT',
 'WING',
 'BODY',
 'FOOT',
 'DUKE',
 'WIRTEMBERG',
 'BODY',
 'FOOT',
 'LOUIS',
 'XIV',
 'III',
 'VII',
 'VIII',
 'XIV',
 'INFANTRY',
 'GHENDT',
 'BRUGES',
 'MALINES',
 'DENDERMOND',
 'OSTEND',
 'DIXMUYDE',
 'DEINSE',
 'DAMME',
 'LEER',
 'AUDENARDE',
 'CAVALRY',
 'BREDA',
 'HAGUE',
 'BOISLEDUC',
 'GHENDT',
 'BRUGES',
 'GERTRUDENBERG',
 'FINIS']

In [184]:
len(allcap)

75

Note that there may be additional errors in the above list. They might be fixed with later code, but you can also make a note of them just in case.

## Roman numerals

There might be Roman numerals (XIV...), so we should figure out a way to exclude those from our substitution, i.e. we want to keep them UPPERCASE. We could identify each Roman numeral and exclude it from the `allcap` list, or we could figure out a pattern to use to identify only Roman numerals. 

Since there are only a few, we'll just `remove` them from the `allcap` list, and check the resulting `len` to make sure no substrings were accidentally matched. But the `remove` method only removes the first item from the list, and we might have more than one instance of the same item, e.g. `XIV` appears twice. We could make a `for` loop to remove each item: 

    for i in allcap:
    if i == 'XIV':
        allcap.remove('XIV')...
        
But, I just happen to have a list of Roman numerals up to 99 (the things you find on the Internet), so we can import that in as a (padded) lexicon and call our usual `lexiconreplaceassign` function.

In [185]:
with open (lexicapath + 'roman_numerals.txt','r') as f:
    roman = f.read().split('\n')
print(roman)

['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX', 'XXI', 'XXII', 'XXIII', 'XXIV', 'XXV', 'XXVI', 'XXVII', 'XXVIII', 'XXIX', 'XXX', 'XXXI', 'XXXII', 'XXXIII', 'XXXIV', 'XXXV', 'XXXVI', 'XXXVII', 'XXXVIII', 'XXXIX', 'XL', 'XLI', 'XLII', 'XLIII', 'XLIV', 'XLV', 'XLVI', 'XLVII', 'XLVIII', 'XLIX', 'L', 'LI', 'LII', 'LIII', 'LIV', 'LV', 'LVI', 'LVII', 'LVIII', 'LIX', 'LX', 'LXI', 'LXII', 'LXIII', 'LXIV', 'LXV', 'LXVI', 'LXVII', 'LXVIII', 'LXIX', 'LXX', 'LXXI', 'LXXII', 'LXXIII', 'LXXIV', 'LXXV', 'LXXVI', 'LXXVII', 'LXXVIII', 'LXXIX', 'LXXX', 'LXXXI', 'LXXXII', 'LXXXIII', 'LXXXIV', 'LXXXV', 'LXXXVI', 'LXXXVII', 'LXXXVIII', 'LXXXIX', 'XC', 'XCI', 'XCII', 'XCIII', 'XCIV', 'XCV', 'XCVI', 'XCVII', 'XCVIII', 'XCIX']


In [186]:
for romnum in roman:
    for cap in allcap:
        if ' ' + romnum + ' ' == ' ' + cap + ' ':
            allcap.remove(cap)
allcap

['EDWARD',
 "D'AUVERGNE",
 'JERSEY',
 'LONDON',
 'EDWARD',
 'COOKE',
 'MAJOR-GENERAL',
 'RAMSAY',
 'SIR',
 'KING',
 'SIR',
 'THE',
 'READER',
 'THIS',
 'BRUGES',
 'THE',
 'HISTORY',
 'THE',
 'CAMPAGNE',
 'THE',
 'OUR',
 'RIGHT',
 'WING',
 'FOOT',
 'LEFT',
 'WING',
 'RESERVE',
 'CAVALRY',
 'INFANTRY',
 'CAVALRY',
 'INFANTRY',
 'ARMY',
 'RIGHT',
 'WING',
 'DUKE',
 'HOLSTEIN',
 'RIGHT',
 'WING',
 'MARQUIS',
 'BEDMAR',
 'RIGHT',
 'WING',
 'RIGHT',
 'WING',
 'BODY',
 'FOOT',
 'DUKE',
 'WIRTEMBERG',
 'BODY',
 'FOOT',
 'LOUIS',
 'XIV',
 'INFANTRY',
 'GHENDT',
 'BRUGES',
 'MALINES',
 'DENDERMOND',
 'OSTEND',
 'DIXMUYDE',
 'DEINSE',
 'DAMME',
 'LEER',
 'AUDENARDE',
 'CAVALRY',
 'BREDA',
 'HAGUE',
 'BOISLEDUC',
 'GHENDT',
 'BRUGES',
 'GERTRUDENBERG',
 'FINIS']

In [187]:
len(allcap)

71

Replace uppercase words with Titlecase words, using `allcap` list. We can't use the `lexiconreplaceassign` function, since we can't change every instance to the same exact replacement, so we'll need to make a bit of code to take the characters for each item and just convert that word to Titlecase.

In [188]:
changesdict['allcap'] = {}

In [189]:
for i in allcap:
    changesdict['allcap'][i] = i.title() # set dict key to i and value to titlecase i
    text = text.replace(i,i.title()) # replace in text
edits += 'c' # take existing value of edits & add c (cap) to it
changesdict['allcap']

{'EDWARD': 'Edward',
 "D'AUVERGNE": "D'Auvergne",
 'JERSEY': 'Jersey',
 'LONDON': 'London',
 'COOKE': 'Cooke',
 'MAJOR-GENERAL': 'Major-General',
 'RAMSAY': 'Ramsay',
 'SIR': 'Sir',
 'KING': 'King',
 'THE': 'The',
 'READER': 'Reader',
 'THIS': 'This',
 'BRUGES': 'Bruges',
 'HISTORY': 'History',
 'CAMPAGNE': 'Campagne',
 'OUR': 'Our',
 'RIGHT': 'Right',
 'WING': 'Wing',
 'FOOT': 'Foot',
 'LEFT': 'Left',
 'RESERVE': 'Reserve',
 'CAVALRY': 'Cavalry',
 'INFANTRY': 'Infantry',
 'ARMY': 'Army',
 'DUKE': 'Duke',
 'HOLSTEIN': 'Holstein',
 'MARQUIS': 'Marquis',
 'BEDMAR': 'Bedmar',
 'BODY': 'Body',
 'WIRTEMBERG': 'Wirtemberg',
 'LOUIS': 'Louis',
 'XIV': 'Xiv',
 'GHENDT': 'Ghendt',
 'MALINES': 'Malines',
 'DENDERMOND': 'Dendermond',
 'OSTEND': 'Ostend',
 'DIXMUYDE': 'Dixmuyde',
 'DEINSE': 'Deinse',
 'DAMME': 'Damme',
 'LEER': 'Leer',
 'AUDENARDE': 'Audenarde',
 'BREDA': 'Breda',
 'HAGUE': 'Hague',
 'BOISLEDUC': 'Boisleduc',
 'GERTRUDENBERG': 'Gertrudenberg',
 'FINIS': 'Finis'}

Now you can double-check for any UPPERCASE that were missed. They should only be strings that didn't meet the `allcap` criteria.

In [190]:
re.findall(r'[A-Z]{2,}',text)

['TO',
 'OF',
 'IN',
 'WE',
 'TH',
 'II',
 'III',
 'IV',
 'VI',
 'VII',
 'VIII',
 'IX',
 'XI']

If you want, you can change those as well, but, for me, they're fine.

In [191]:
text_u = text

# Elisions and stray apostrophes

Another use for apostrophes is to combine together words so they flow better, in French, for example. We want to identify and preserve those. We can find them searching `text` with a regular expression.

But, sometimes, we might need to delete apostophe rather than combine it. So let's start narrow and limit our regex to known elision characters: `l`, `d`, `m`... followed by capitalized words

## Check for possible elisions requiring compression

In [192]:
elisiontocompress = {}
for i in re.findall(r'\b[ldmLDOM]\' [AEIOU]\S*\b',text): # add other letter combos as needed
    elisiontocompress[i] = re.sub("\' ","\'",i)
elisiontocompress

{"d' Auverquerque": "d'Auverquerque",
 "d' Areo": "d'Areo",
 "d' Auverquerque's": "d'Auverquerque's",
 "d' Ardenne": "d'Ardenne",
 "l' Estrang": "l'Estrang",
 "L' Estrang": "L'Estrang",
 "l' Arteloire": "l'Arteloire",
 "d' Anverquerque": "d'Anverquerque",
 "d' Arms": "d'Arms"}

In [193]:
changesdict['elision'] = {}

In [194]:
text = lexiconreplaceassign('elision',changesdict,elisiontocompress,text)
changesdict['elision']

{"d' Auverquerque": "d'Auverquerque",
 "d' Areo": "d'Areo",
 "d' Ardenne": "d'Ardenne",
 "l' Estrang": "l'Estrang",
 "L' Estrang": "L'Estrang",
 "l' Arteloire": "l'Arteloire",
 "d' Anverquerque": "d'Anverquerque",
 "d' Arms": "d'Arms"}

## Check for possible apostrophes requiring deletion

Check for apostrophes with spaces on either side

In [195]:
apostspace = {}
for i in re.findall(r'\b\S* \' [a-zA-Z]\S*\b',text):
    apostspace[i] = re.sub(" \' "," ",i)
apostspace

{}

In [196]:
changesdict['apostspace'] = {}

In [197]:
text = lexiconreplaceassign('apostspace',changesdict,apostspace,text)
changesdict['apostspace']

{}

Check for apostrophes padded left

In [198]:
apostspaceL = {}
for i in re.findall(r'\b\S* \'[a-zA-Z]\S*\b',text):
    apostspaceL[i] = re.sub(" \'"," ",i)
apostspaceL

{"For, 'tis": 'For, tis',
 "yet 'tis": 'yet tis',
 "Winter, 'tis": 'Winter, tis',
 "because 'twas": 'because twas',
 "And 'tis": 'And tis',
 "for 'tis": 'for tis',
 "us, 'twas": 'us, twas',
 "and 'tis": 'and tis',
 "reason, 'tis": 'reason, tis',
 "as 'twas": 'as twas',
 "time 'twas": 'time twas',
 "Battle, 'tis": 'Battle, tis',
 "Honour: 'Tis": 'Honour: Tis',
 "Field. 'Tis": 'Field. Tis',
 "Forage, 'twas": 'Forage, twas',
 "though 'tis": 'though tis',
 "Forage; 'tis": 'Forage; tis',
 "things, 'twas": 'things, twas',
 "as 'tis": 'as tis',
 "where 'tis": 'where tis',
 "Fourth: 'Twas": 'Fourth: Twas',
 "Mons 'tis": 'Mons tis',
 "Fortifications, 'tis": 'Fortifications, tis',
 "reason 'twas": 'reason twas',
 "Army: 'Tis": 'Army: Tis',
 "Enemy, 'tis": 'Enemy, tis',
 "which 'tis": 'which tis',
 "yet 'twas": 'yet twas',
 "which 'twas": 'which twas',
 "conquis, 'twas": 'conquis, twas',
 "Thus 'tis": 'Thus tis',
 "but 'tis": 'but tis',
 "Town. 'Tis": 'Town. Tis',
 "French. 'Twas": 'French. Twas'

In [199]:
changesdict['apostspaceL'] = {}

In [200]:
text = lexiconreplaceassign('apostspaceL',changesdict,apostspaceL,text)
changesdict['apostspaceL']

{"For, 'tis": 'For, tis',
 "yet 'tis": 'yet tis',
 "Winter, 'tis": 'Winter, tis',
 "because 'twas": 'because twas',
 "And 'tis": 'And tis',
 "for 'tis": 'for tis',
 "us, 'twas": 'us, twas',
 "and 'tis": 'and tis',
 "reason, 'tis": 'reason, tis',
 "as 'twas": 'as twas',
 "time 'twas": 'time twas',
 "Battle, 'tis": 'Battle, tis',
 "Honour: 'Tis": 'Honour: Tis',
 "Field. 'Tis": 'Field. Tis',
 "Forage, 'twas": 'Forage, twas',
 "though 'tis": 'though tis',
 "Forage; 'tis": 'Forage; tis',
 "things, 'twas": 'things, twas',
 "as 'tis": 'as tis',
 "where 'tis": 'where tis',
 "Fourth: 'Twas": 'Fourth: Twas',
 "Mons 'tis": 'Mons tis',
 "Fortifications, 'tis": 'Fortifications, tis',
 "reason 'twas": 'reason twas',
 "Army: 'Tis": 'Army: Tis',
 "Enemy, 'tis": 'Enemy, tis',
 "which 'tis": 'which tis',
 "yet 'twas": 'yet twas',
 "which 'twas": 'which twas',
 "conquis, 'twas": 'conquis, twas',
 "Thus 'tis": 'Thus tis',
 "but 'tis": 'but tis',
 "Town. 'Tis": 'Town. Tis',
 "French. 'Twas": 'French. Twas'

Check for apostspace right

In [201]:
re.findall(r'\b\S*\' [a-zA-Z]\S*\b',text)

["Gensd' armes", "d' Harcourt", "D' Witz", "tho' they"]

Here too, the changes could go multiple directions. Let's save yet another list, edit it, reread it back in, and then substitute.

In [202]:
apostspace = {}
for i in re.findall(r'\b\S*\' [a-zA-Z]\S*\b',text):
    apostspace[i] = i
apostspace

{"Gensd' armes": "Gensd' armes",
 "d' Harcourt": "d' Harcourt",
 "D' Witz": "D' Witz",
 "tho' they": "tho' they"}

In [203]:
with open(outputpath + filename + '_apostspace.csv', 'w', encoding = 'UTF-8') as f:
    writer = csv.writer(f)
    for k,v in apostspace.items():
        writer.writerow([k, v])

In [204]:
changesdict['apostspaceR'] = {}

In [205]:
if newdoc == 'N':
    with open (outputpath + filename + '_apostspace1.csv','r') as subs:
        reader = csv.reader(subs)
        apostspaceR = {rows[0]:rows[1] for rows in reader} # 1st row as key, 2nd row as value
    text = lexiconreplaceassign('apostspaceR',changesdict,apostspaceR,text)

In [206]:
changesdict['apostspaceR']

{"Gensd' armes": "Gens d'armes",
 "d' Harcourt": "d'Harcourt",
 "D' Witz": "D'Witz",
 "tho' they": 'though they'}

In [207]:
edits += 'e'
text_e = text

## Check remaining hyphens

Display hyphens not fixed already.

In [208]:
dashleft = []
for i in re.findall(r'\b\S*-\W*\S*\b',text):
    dashleft.append(i)
set(dashleft)

{'1660-1737',
 'Artillery-Horses',
 'Battering-Pieces',
 'Bernstorf-Zell',
 'Blood-Royal',
 'Bomb-Ketches',
 'Bread-Waggons',
 'Colonel-General',
 'Commissary-General',
 'Court-Marshal',
 'Drinking-Money',
 'Field-Officer',
 'Field-Pieces',
 'Field-pieces',
 'Fitz-Patrick',
 'Fleet-street',
 'Flower-de-luces',
 'Great-Britain',
 'Holstein-Beck',
 'Holstein-Norbourg',
 'Holstein-Norburg',
 'Holstein-Ploen',
 'Holstein-Ploens',
 'Horse-Granadiers',
 'Kings-Armes',
 'Liege-Dragoons',
 'Lieutenant-Colonel',
 'Lieutenant-Colonels',
 'Lieutenant-General',
 'Lieutenant-Generals',
 'Life-G',
 'Life-Gua',
 'Life-Guard',
 'Life-Guards',
 'Life-Regiment',
 'Life-guards',
 'Livery-Coats',
 'Major-General',
 'Major-Generals',
 'Mons-Port',
 'Mortar-pieces',
 'Neer-Ische',
 'Neptune-like',
 'Nuns-of',
 'Out-Forts',
 'Passage-Barge',
 'Quarter-Master-General',
 'Quarter-Masters',
 'Roman-Catholick',
 'Sentry-boxes',
 'Serjeant-General',
 'Six-pounders',
 'Small-shot',
 'Snap-sacks',
 'State-Major',
 

In [209]:
len(dashleft)

209

Some of these may need to be fixed, but it might be best to fix them in the next notebook, where we will replace the `-` with `_`, so that we can keep the compound words connected together. For example, `snap-sacks` might be a particular type of sack that should be distinguished from other sacks. If that's the case, then linking the `snap` to `sack` can be done with the underscore.

But there are a few that you can clean up manually. We can use another function to do that:

In [210]:
# text = textreplacereplace('hyphen',changesdict,'Nuns-of','Nuns of',text)
# changesdict['hyphen']

Note that when you use the `addextra` function, this is document-specific. But since the function starts with an "if the error is in `text`...", you don't need to worry about an error that occurs in document1 being added to document2's `changesdict`, unless the error actually exists in document2 as well. 

In [211]:
text_h = text

# Syconpat'd replacements

This loads another Underwood substitution lexicon as a dict, which substitutes the common early modern synocopated words with the full version: `doubl'd` into `doubled`.

In [212]:
with open (lexicapath + 'SyncopeRules_caps.csv','r') as subs:
    reader = csv.reader(subs)
    syncope = {rows[0]:rows[1] for rows in reader}
syncope

{"'tis": 'tis',
 "'Tis": 'tis',
 "abandon'd": 'abandoned',
 "Abandon'd": 'abandoned',
 "abas'd": 'abased',
 "Abas'd": 'abased',
 "abash'd": 'abashed',
 "Abash'd": 'abashed',
 "abhor'd": 'abhorred',
 "Abhor'd": 'abhorred',
 "abhorr'd": 'abhorred',
 "Abhorr'd": 'abhorred',
 "abjur'd": 'abjured',
 "Abjur'd": 'abjured',
 "abolish'd": 'abolished',
 "Abolish'd": 'abolished',
 "above-mention'd": 'above-mentioned',
 "Above-mention'd": 'above-mentioned',
 "above-nam'd": 'above-named',
 "Above-nam'd": 'above-named',
 "abovemention'd": 'abovementioned',
 "Abovemention'd": 'abovementioned',
 "abovenam'd": 'abovenamed',
 "Abovenam'd": 'abovenamed',
 "abridg'd": 'abridged',
 "Abridg'd": 'abridged',
 "absolv'd": 'absolved',
 "Absolv'd": 'absolved',
 "absorb'd": 'absorbed',
 "Absorb'd": 'absorbed',
 "abstain'd": 'abstained',
 "Abstain'd": 'abstained',
 "absur'd": 'absurd',
 "Absur'd": 'absurd',
 "abus'd": 'abused',
 "Abus'd": 'abused',
 "accompany'd": 'accompanied',
 "Accompany'd": 'accompanied',
 "ac

In [213]:
len(syncope)

6547

## Create `syncopedict` with changes

Use lexicon function to make replacements.

In [214]:
changesdict['syncope'] = {}

In [215]:
text = lexiconreplaceassign('syncope',changesdict,syncope,text)
changesdict['syncope']

{"abandon'd": 'abandoned',
 "accompany'd": 'accompanied',
 "advanc'd": 'advanced',
 "alledg'd": 'alleged',
 "appear'd": 'appeared',
 "arriv'd": 'arrived',
 "attack'd": 'attacked',
 "believ'd": 'believed',
 "belong'd": 'belonged',
 "bestow'd": 'bestowed',
 "call'd": 'called',
 "camp'd": 'camped',
 "canton'd": 'cantoned',
 "chasten'd": 'chastened',
 "Cloath'd": 'clothed',
 "concern'd": 'concerned',
 "Conquer'd": 'conquered',
 "consider'd": 'considered',
 "Copy'd": 'copied',
 "cover'd": 'covered',
 "decoy'd": 'decoyed',
 "design'd": 'designed',
 "detach'd": 'detached',
 "Detach'd": 'detached',
 " din'd ": ' dined ',
 "dispers'd": 'dispersed',
 "dispos'd": 'disposed',
 "dress'd": 'dressed',
 "eas'd": 'eased',
 "expos'd": 'exposed',
 "express'd": 'expressed',
 "fatigu'd": 'fatigued',
 "fill'd": 'filled',
 "fir'd": 'fired',
 "flank'd": 'flanked',
 "flush'd": 'flushed',
 "follow'd": 'followed',
 "forc'd": 'forced',
 "form'd": 'formed',
 "gain'd": 'gained',
 "gras'd": 'grassed',
 "guess'd": 'g

How many sycnopat'd words fixed?

In [216]:
len(changesdict['syncope'])

99

In [217]:
edits += 's'
text_s = text

# Hyphenated words separated by line break(s)

With OCRed texts, words might be hyphenated at the end of a line. These need to be rejoined. Note that when you rejoin words, you might create words which then need some further correction. For example, rejoining `garr- ifon` will turn it into a long-s word, which in turn needs to be converted from `garrifon` to `garrison`. So you need to rejoin hyphenated words early on in the notebook, or else you'll need to rerun some of your code again. In general, think carefully (and experiment) to see the exact order in which the various corrections need to be executed. This is also why keeping track of the order in the `edits` variable is useful.

In [218]:
re.findall(r'\-',text)

['-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-',
 '-'

In [219]:
re.findall(r'-\n{1,4}[a-z]',text)

[]

In [220]:
re.findall(r'\b\S*-\n{1,4}[a-z]\S*\b',text)

[]

The code for this correction is a bit more complicated, since you need to delete the hyphen and one or more line breaks. If it's only one return, that's straightforward enough. But if there are multiple returns, your code needs to be a bit more flexible. This code does that by using the `group` feature of regex, but it may need some work.

In [221]:
changesdict['split'] = {}

In [222]:
whole = re.compile(r'\b\S*-\n{1,4}[a-z]\S*\b')
withpar = re.compile(r'(\b\S*)(-\n{1,4})([a-z]\S*\b)')

for i in re.findall(whole,text):
    subs = re.sub(withpar,r'\1\3',i)
    changesdict['split'][i] = subs
changesdict['split']

{}

In [223]:
edits += 'h3'
text_h3 = text

# Rejoin hyphenated words

Find hyphenated words

In [224]:
re.findall(r'\S*\- \S*',text)

[]

NB: Just deleting `- ` might be quickest, but there will be a couple of false positives, words that get compressed together which should be separate.

Maybe use `if` second part starts with upper, just delete hyphen and not space? 

In [225]:
for i in re.findall(r'(\S*)(\- )(\S*)',text):
    if i[2].istitle():
        print(i)

In [226]:
hyphennospace = {}
wholehyphen = re.compile(r'\S*\- \S*')
parthyphen = re.compile(r'(\S*)(\- )(\S*)')

In [227]:
for i in re.findall(wholehyphen,text):
    for j in re.findall(r'(\S*)(\- )(\S*)',i):
        if j[2].istitle():
            subs = re.sub(parthyphen,r'\1-\3',i)
            hyphennospace[i] = subs
        else:
            subs = re.sub(parthyphen,r'\1\3',i)
            hyphennospace[i] = subs
hyphennospace

{}

In [228]:
changesdict['hyphens'] = {}

In [229]:
text = lexiconreplaceassign('hyphens',changesdict,hyphennospace,text)
changesdict['hyphens']

{}

In [230]:
len(changesdict['hyphens'])

0

# `CorrectionRules`

Now we can use the heavy lifting work provided by others. Ted Underwood's `correctionrules` lexicon includes 192,000 common errors derived from cleaning tens of thousands of EEBO TCP texts. It is, therefore, an excellent pre-curated set of targeted corrections that we can take advantage of. I've added numerous additions, based off my own corpus, and also delete some.

In [231]:
with open (lexicapath + 'CorrectionRules_final_curated.csv','r') as subs:
    reader = csv.reader(subs)
    correct = {rows[0]:rows[1] for rows in reader} # 1st row as key, 2nd row as value
correct

{" 'twas ": ' it was ',
 " 'Twas ": ' It was ',
 ' (hall ': ' shall ',
 ' aad ': ' and ',
 ' abateth': ' abates',
 ' Abateth': ' abates',
 ' abf ': ' abs ',
 ' abha ': ' abba ',
 ' acbe ': ' ache ',
 ' accl ': ' acct ',
 ' accr ': ' acer ',
 ' acef ': ' aces ',
 ' acel ': ' aces ',
 ' achc ': ' ache ',
 ' achf ': ' achs ',
 ' acif ': ' acis ',
 ' acla ': ' acta ',
 ' acle ': ' acte ',
 ' acled ': ' acted ',
 ' acles ': ' actes ',
 ' acling ': ' acting ',
 ' acons ': ' aeons ',
 ' actf ': ' acts ',
 ' actiou ': ' action ',
 ' adco ': ' adeo ',
 ' adde ': ' add ',
 ' addeth': 'adds',
 ' Addeth': 'Adds',
 ' aded ': ' acted ',
 ' adf ': ' ads ',
 ' ading ': ' acting ',
 ' adio ': ' actio ',
 ' adium ': ' actium ',
 ' adled ': ' acted ',
 ' adles ': ' actes ',
 ' adling ': ' acting ',
 ' adors ': ' actors ',
 ' adte ': ' acte ',
 ' adual ': ' actual ',
 ' adually ': ' actually ',
 ' Adually ': ' Actually ',
 ' aduate ': ' actuate ',
 ' adum ': ' actum ',
 ' afay ': ' assay ',
 ' afce ': ' a

In [232]:
len(correct)

193255

If you want to check the corrections to be made before you run the substitutions, you can do it with the code below. But this will probably take a few minutes, since it has to loop through 192,000 items across the entire text. Uncomment and run it if you're willing to wait.

In [233]:
# for k,v in correct.items():
#     if k in words:
#         print(k,v)

## Create `correctdict` and change

This may take a few minutes to run, depending on your computer, since it checks each of 192,000 items against every word in your `text`.

In [234]:
changesdict['correct'] = {}

In [235]:
text = lexiconreplaceassign('correct',changesdict,correct,text)
changesdict['correct']

{' aad ': ' and ',
 ' aud ': ' and ',
 ' Canon ': ' cannon ',
 ' Cense ': ' cense ',
 ' conld ': ' could ',
 ' Dopf ': ' Dopff ',
 ' Dopfs ': " Dopff's ",
 ' entred': ' entered',
 ' Escorte ': ' escort ',
 ' Furnes': 'Veurnes',
 ' Furr ': ' fur ',
 ' Furrs ': ' furs ',
 ' Garison ': ' garrison ',
 ' Genap ': ' Genappe ',
 ' Gramont ': ' Grammont ',
 ' Habour ': ' harbor ',
 ' hath ': ' has ',
 ' incamp': ' encamp',
 ' leasure ': ' leisure ',
 ' leasurely ': ' leisurely ',
 " Luy d'ore ": " louis d'or ",
 ' Maes ': ' Maas ',
 ' magnifie ': ' magnify ',
 ' Malines ': ' Mechelen ',
 ' my self ': ' myself ',
 ' onely ': ' only ',
 ' raigned ': ' reigned ',
 ' rowes ': ' rows ',
 ' ruine ': ' ruin ',
 ' Salley ': ' sally ',
 ' Scheld ': ' Scheldt ',
 ' Terces ': ' Tercios ',
 ' Tettan ': ' Tettau ',
 ' tis ': 'it is',
 ' Troup ': ' Troop ',
 ' Troups ': ' Troops ',
 ' wearie ': ' weary ',
 'Aeth ': 'Ath ',
 'Albergoti': 'Albergotti',
 'alledge': 'allege',
 'Ameliswert': 'Amelisweerd',
 'Amm

How many errors did the `CorrectionRules` lexicon catch?

In [236]:
len(changesdict['correct'])

299

Pretty effective - standing on the shoulders of giants, and all that.

In [237]:
edits += 'c'
text_c = text

# `Long-s` substitutions (first pass)

Earlier we used code to replace the official long-s character (`ſ`) with an `f`. Now, we will replace all of the long-s `f`s with the `s`. We do this in several steps.

## Long-s ambiguous pairs

One of the problems with converting long-s words is that English has some words that are ambiguous, i.e. without context, it's unclear whether the string `fail` should be the word `fail`, or rather the word `sail`. Ted Underwood's `AmbiguousPairs` lists about 400 possibilities, including various inflections (`fails`,`failed`,`failing`...).

In order to automate this, we can proceed probabilistically, using frequences to make the best guess. To give an example, we can search through a large English corpus and count up the frequencies of `fail` and `sail` with ngrams, i.e. given the words surrounding this instance of `fail`, say, `to fail away`, is it more likely to be `fail` or `sail`? If there are examples of both ngrams, we can either choose the most likely choice, or we can display it to confirm by eye. I already created the frequencies for the long-f words, and saved it as a separate lexicon, which we can load here.

For now, I've just removed the pairs that are ambiguous *given my corpus* from Underwood's long-s list. For example, in my military-political-diplomatic documents, `cafe` is extremely unlikely, whereas `case` was used *a lot*. Similarly, Jesus doesn't `fave`, he `save`(s), to `see` is much more likely than a `fee`, and so on. But another reason to maintain that audit trail.

## Replace long-s with lexicon

Now we can replace the remaining long-s words by using another Underwood lexicon.

In [238]:
with open (lexicapath + 'long_s_subs_caps.csv','r') as subs:
    reader = csv.reader(subs)
    longsedits = {rows[0]:rows[1] for rows in reader}
longsedits

{' cafe ': ' case ',
 ' caft ': ' cast ',
 ' drefs ': ' dress ',
 ' eaft ': ' east ',
 ' fafe ': ' safe ',
 ' faid ': ' said ',
 ' Faid': ' Said',
 ' fave ': ' save ',
 ' Fave': ' Save ',
 ' faved ': ' saved ',
 ' faw ': ' saw ',
 ' fay ': ' say ',
 ' fee ': ' see ',
 ' feeing ': ' seeing ',
 ' feen ': ' seen ',
 ' fent ': ' sent ',
 ' fet ': ' set ',
 ' fide ': ' side ',
 ' fides ': ' sides ',
 ' fo ': ' so ',
 ' fome ': ' some ',
 ' foon ': ' soon ',
 ' moft ': ' most ',
 ' Monf ': ' Monsieur ',
 ' paff ': ' pass ',
 ' reft ': ' rest ',
 ' ufe ': ' use ',
 '-fide': '-side',
 '1ft': '1st',
 '21ft': '21st',
 '2ift': '21st',
 '31ft': '31st',
 '3ift': '31st',
 'abfence': 'absence',
 'abfolute': 'absolute',
 'abfolutely': 'absolutely',
 'abfurd': 'absurb',
 'abfurdity': 'absurbity',
 'abufe': 'abuse',
 'abufed': 'abused',
 'abufes': 'abuses',
 'abufing': 'abusing',
 'acceffories': 'accessories',
 'acceffory': 'accessory',
 'accefs': 'access',
 'accefsed': 'accessed',
 'accefses': 'accesse

In [239]:
len(longsedits)

964

Find long-s errors. The VEP versions have removed all the long-s words.

In [240]:
for old,new in longsedits.items():
    if old in text:
        print(old)

defert
deferted


In [241]:
changesdict['longs'] = {}

In [242]:
text = lexiconreplaceassign('longs',changesdict,longsedits,text)
changesdict['longs']

{'defert': 'desert'}

In [243]:
for k in longsedits.keys():
    if k.lower() in text:
        print(k)

If any hits appear in the above code, you could make it into a list and then run your `lexiconreplaceassign` function on that list. Another way to deal with the problem of titlecase occurrences is to change the underlying lexicon file with Python: duplicate each item in the lexicon and titlecase the duplicate. Alternately, you could use the `title` method in code.

In [244]:
edits += 'f'
text_f = text

# Delete periods from dates

Need to fix if we want to do sentence tokenization, e.g. `On the 18th. of May...`

In [245]:
dateperiod = []
for i in re.findall(r'\S*\b \d{1,2}th\.\s*\S*\b',text):
    dateperiod.append(i)
dateperiod

['the 18th. of',
 'The 24th. the',
 'The 26th. the',
 'The 27th. the',
 'The 28th. the',
 'The 20th. the',
 'The 4th. we',
 'the 4th. and',
 'The 5th. the',
 'The 6th. the',
 'the 8th. the',
 'The 10th. the',
 'The 11th. the',
 'The 11th. in',
 'The 12th. sixteen',
 'The 13th. the',
 'The 15th. Seven',
 'The 16th. the',
 'the 17th. because',
 'The 18th. they',
 'The 19th. a',
 'The 20th. a',
 'The 24th. One',
 'The 25th. we',
 'The 26th. the',
 'The 27th. the',
 'The 28th. in',
 'The 29th. our',
 'The 4th. the',
 'The 7th. Lieutenant-General',
 'The 10th. the',
 'The 11th. One',
 'The 12th. the',
 'the 13th. by',
 'The 16th. Stuarts',
 'The 17th. the',
 'the 9th. fired',
 'The 11th. Major-General',
 'the 9th. but',
 'the 14th. with',
 'The 25th. the',
 'The 26th. the',
 'The 29th. Count',
 'The 7th. the',
 'the 8th. to',
 'The 8th. the',
 'the 9th. to',
 'The 10th. the',
 'The 11th. His',
 'The 9th. the',
 'The 10th. we',
 'The 11th. the',
 'The 12th. the',
 'The 13th. we',
 'The 14th.

In [246]:
len(dateperiod)

59

Replace and check again

In [247]:
for i in dateperiod:
    text = re.sub(r'(\d{1,2})th.','\g<1>th ',text)
re.findall(r'\S*\b \d{1,2}th\.\s*\S*\b',text)

[]

In [248]:
edits += 'p'

In [249]:
text_p = text

In [250]:
text

"The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy\nD'Auvergne, Edward, 1660-1737.\n\nBy Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.\n\nLondon, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.\n\n\nImprimatur,\n\nNovemb. 20. 1694.\nEdward Cooke.\n[page]\nTo the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.\n\nSir,\nI Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must fi

# Replace double-spaces with single space

In [251]:
re.findall(r'  ',text)

['  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ',
 '  ']

In [252]:
text = text.replace('  ',' ')
text

"The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy\nD'Auvergne, Edward, 1660-1737.\n\nBy Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.\n\nLondon, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.\n\n\nImprimatur,\n\nNovemb. 20. 1694.\nEdward Cooke.\n[page]\nTo the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.\n\nSir,\nI Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must fi

In [253]:
edits += ' '

In [254]:
text__ = text

# Remove extra line breaks

Unless multiple line breaks have a special meaning in your document, you can collapse them down to a single return.

In [255]:
linebreak = re.findall(r'\n{2,5}', text)
linebreak

['\n\n',
 '\n\n',
 '\n\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',
 '\n\n',

In [256]:
len(linebreak)

189

We probably don't need to add these before and after changes to `changedict`, so we can just use a simple loop:

In [257]:
for i in re.findall(r'\n{2,5}',text):
    text = re.sub(i,r'\n',text)
print(text)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur,
Novemb. 20. 1694.
Edward Cooke.
[page]
To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir,
I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults.

We can, however, add the change to `edits`.

In [258]:
edits += 'r'
text_r = text

## Remove all `\n` separating sentences

You might have one or more line breaks between parts of a sentence. Our previous replacement of multiple returns with a single return should have left us with only one line break in a row. Now we can diagnose where they occur, and decide if they need to be removed in order to rejoin a sentence split across a line break.

In [259]:
premature = re.findall(r'\b\w*? {0,1}\n {0,1}[a-z]+?\b',text)
for i in enumerate(premature):
    print(i)

(0, '1\nchevaux')
(1, '1\nmusketeers')
(2, '2\nmusketeers')


Manually change premature line breaks as needed.

In [260]:
#text = addextra()

In [261]:
text_pre = text

# Intermediate save text

Before we go further, we might as well save an intermediate copy of the revised text. Just be sure to name it very clearly.

In [262]:
with open(processedpath + filename + '_' + edits + '_intermediate.txt', 'w', encoding = 'UTF-8') as export:
    export.write(text)

In [263]:
len(text)

205155

# Compare with word dictionary

Now that we have a cleaner version of the document, we should do a broader search for remaining errors. We can tokenize our text and compare the tokens with an English language dictionary.

Make a new list of the tokens in the corrected text, excluding the specified characters (i.e. punctuation).

In [264]:
wordsdirty = word_tokenize(text)
wordsdirty

['The',
 'history',
 'of',
 'the',
 'campaign',
 'in',
 'the',
 'Spanish',
 'Netherlands',
 ',',
 'Anno',
 'Dom',
 '.',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne",
 ',',
 'Edward',
 ',',
 '1660-1737',
 '.',
 'By',
 'Edward',
 "D'Auvergne",
 ',',
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade',
 ',',
 'in',
 'the',
 'Isle',
 'of',
 'Jersey',
 ',',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesty',
 "'s",
 'Regiment',
 'of',
 'Scots',
 'Guards',
 '.',
 'London',
 ',',
 'Printed',
 'for',
 'Matt',
 '.',
 'Wotton',
 ',',
 'at',
 'the',
 'Three',
 'Daggers',
 ';',
 'and',
 'John',
 'Newton',
 ',',
 'at',
 'the',
 'Three',
 'Pigeons',
 ',',
 'near',
 'Temple-Barr',
 ',',
 'in',
 'Fleet-street',
 ',',
 '1694',
 '.',
 'Imprimatur',
 ',',
 'Novemb',
 '.',
 '20',
 '.',
 '1694',
 '.',
 'Edward',
 'Cooke',
 '.',
 '[',
 'page',
 ']',
 'To',
 'the',
 'Honourable',
 'Major-General',
 'Ramsay',
 ',',
 'Colonel',
 'of',
 'Their',
 'Majesty',
 "'s",
 'R

In [265]:
len(wordsdirty)

41738

Eliminate all punctuation from `wordsdirty`

In [266]:
tokens2 = [i for i in wordsdirty if i not in (',','-',';',':','.','’','&','#','$','!','%','\'','*','•','(',')')]
tokens2

['The',
 'history',
 'of',
 'the',
 'campaign',
 'in',
 'the',
 'Spanish',
 'Netherlands',
 'Anno',
 'Dom',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne",
 'Edward',
 '1660-1737',
 'By',
 'Edward',
 "D'Auvergne",
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade',
 'in',
 'the',
 'Isle',
 'of',
 'Jersey',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesty',
 "'s",
 'Regiment',
 'of',
 'Scots',
 'Guards',
 'London',
 'Printed',
 'for',
 'Matt',
 'Wotton',
 'at',
 'the',
 'Three',
 'Daggers',
 'and',
 'John',
 'Newton',
 'at',
 'the',
 'Three',
 'Pigeons',
 'near',
 'Temple-Barr',
 'in',
 'Fleet-street',
 '1694',
 'Imprimatur',
 'Novemb',
 '20',
 '1694',
 'Edward',
 'Cooke',
 '[',
 'page',
 ']',
 'To',
 'the',
 'Honourable',
 'Major-General',
 'Ramsay',
 'Colonel',
 'of',
 'Their',
 'Majesty',
 "'s",
 'Regiment',
 'of',
 'Scots',
 'Guards',
 'etc',
 'Sir',
 'I',
 'Need',
 'not',
 'make',
 'an',
 'Apology',
 'for',
 'Presenting',
 'the',
 'Accoun

In [267]:
len(tokens2)

37044

Take the `set` of the above list to see how many "types" (i.e. unique tokens) are in the revised token list.

In [268]:
tokens2set = set(tokens2)
tokens2set

{'Andre',
 'however',
 '30',
 'Alfeldt',
 'Tidcomb',
 'Monsit',
 'Condé',
 'Opening',
 'National',
 'low',
 'other',
 'All',
 'Stone',
 'He',
 'mangled',
 'Fortune',
 'viewing',
 'Tergueson',
 'Brewer',
 'thither',
 'Steenkerque',
 'supplied',
 '92',
 'Passage',
 'Taviers',
 'Dunkirk',
 'visit',
 'Impression',
 'Electoral',
 'Hesse',
 'City',
 'encouraged',
 'deserting',
 'Twelve',
 'Staine',
 'beginning',
 'Ham',
 'Brandenburg',
 'newly',
 'Elbow',
 'victorious',
 'Nobility',
 'amuse',
 'reports',
 'ado',
 'believe',
 'Brigade',
 'fall',
 'Incursions',
 'Choice',
 'Cumtich',
 'Earls',
 'before',
 'the',
 'Danish',
 'Forage',
 '31',
 'Clamorous',
 'Tongres',
 'holes',
 'Bounds',
 'managed',
 'Files',
 'III',
 'information',
 'Vermandois',
 'case',
 'prejudicial',
 'execution',
 'authentic',
 'Ferguson',
 'Custom',
 'Interest',
 'Citadels',
 'stript',
 'War',
 'isYour',
 'Bastion',
 'opening',
 'Italien',
 'justly',
 'Debts',
 'Five',
 'Absolute',
 'want',
 'carry',
 'procure',
 'effect

In [269]:
len(tokens2set)

4063

In other words, 4,000 unique words ("types") occurring a total of 37,000 times.

## Read in eMOP English dictionary to check spelling

The next step is to check our corrected text against a list of English words from the period. We'll use the Early Modern OCR Project (EMOP) word list: https://github.com/Early-Modern-OCR/TesseractTraining. It's a relatively small word dictionary of about 121,000 words. Using a large dictionary (you can find English word lists of 300,000+) will make it more difficult to make accurate corrections, because there are lots of obscure (and modern) words that our c.1700 sources were highly unlikely to use. So we'll keep the 'master dictionary' small, and add more words from our specific domain (in my case, military terminology, early modern French and English titles and ranks...) as needed.

Read word list in. It will be used to find 'real' words that don't need to be corrected further, and flag words that we should check.

In [270]:
with open(lexicapath + 'eMOP_en_dict_edited.txt', 'r',encoding='UTF-8') as f:
    emop_en = f.read().split('\n')
emop_en

['a',
 "a's",
 "aa's",
 'aachen',
 'aaliyah',
 "aaliyah's",
 'aardvark',
 "aardvark's",
 'aardvarks',
 'aaron',
 "ab's",
 'abaci',
 'aback',
 'abacus',
 "abacus's",
 'abacuses',
 'abaft',
 'abalone',
 "abalone's",
 'abalones',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 "abandonment's",
 'abandons',
 'abase',
 'abased',
 'abasement',
 "abasement's",
 'abases',
 'abash',
 'abashed',
 'abashedly',
 'abashes',
 'abashing',
 'abashment',
 "abashment's",
 'abasing',
 'abate',
 'abated',
 'abatement',
 "abatement's",
 'abates',
 'abating',
 'abattoir',
 "abattoir's",
 'abattoirs',
 'abbas',
 'abbasid',
 'abbe',
 'abbé',
 "abbe's",
 "abbé's",
 'abbes',
 'abbés',
 'abbess',
 "abbess's",
 'abbesses',
 'abbey',
 "abbey's",
 'abbeys',
 'abbot',
 "abbot's",
 'abbots',
 'abbott',
 "abbott's",
 'abbr',
 'abbrev',
 'abbreviate',
 'abbreviated',
 'abbreviates',
 'abbreviating',
 'abbreviation',
 "abbreviation's",
 'abbreviations',
 'abbrevs',
 'abby',
 "abby's",
 'abc',
 "abc's",
 'abcs'

In [271]:
len(emop_en)

121937

## Find non-`emop_en` tokens in `tokens2set`

With our `emop_en`, we can find tokens that are *not* in the list and decide how we want to deal with them.

In [272]:
notinemop = []
for i in tokens2:
    if i.lower() not in emop_en: # note lowercasing each token to match emop format
        notinemop.append(i)
notinemop

['Anno',
 'Dom',
 '1694',
 'Huy',
 "D'Auvergne",
 '1660-1737',
 "D'Auvergne",
 'M.',
 'A.',
 'St.',
 'Brelade',
 "'s",
 'Wotton',
 'Temple-Barr',
 'Fleet-street',
 '1694',
 'Novemb',
 '20',
 '1694',
 '[',
 ']',
 'Major-General',
 "'s",
 'isimpossible',
 '[',
 ']',
 '[',
 ']',
 'Maastricht',
 'St.',
 'Steenkerque',
 'Landen',
 "'s",
 '[',
 ']',
 'yetit',
 'isYour',
 'Major-General',
 "'s",
 'Dalhousy',
 "'s",
 "D'Auvergne",
 '[',
 ']',
 'Copyer',
 '[',
 ']',
 'Brugge',
 'Novemb',
 '5/15',
 '1694',
 '[',
 '1',
 ']',
 'Anno',
 'Dom',
 '1694',
 "'s",
 'Winter-Quarters',
 'Charleroi',
 'Winter-Quarters',
 '[',
 '2',
 ']',
 'Waldeck',
 'Fleuri',
 '70000',
 'non-plus',
 'Landen',
 '[',
 '3',
 ']',
 'Landen',
 'Geet',
 'completeing',
 'dissentions',
 "'s",
 'suffrages',
 'Colognee',
 "'s",
 'isvery',
 "'s",
 'Maastricht',
 'equaly',
 'States-General',
 '[',
 '4',
 ']',
 'Tiffeny',
 'Camerlings',
 'Ambacht',
 'St.',
 'Friderick',
 'Ingoldsby',
 "'s",
 'Ostend',
 'Brugge',
 '[',
 '5',
 ']',
 "'s

In [273]:
len(notinemop)

3466

In [274]:
len(set(notinemop))

1021

There are almost 1,000 unique tokens not in `emop_en`, but this is as much an issue with the emop dictionary as with the errors in our source. Some of these 'non-words' might be digits and punctuation (brackets especially). Others might be valid proper nouns - personal names, places, organzations, etc. Some might actually be parts of a compound term that will get fixed once `underscore` substitutions are made in the next notebook.

We can get a better sense by looking at the unique tokens:

In [275]:
notinemop1 = sorted(set(notinemop))
notinemop1

["'s",
 '1',
 '10',
 '100',
 '101',
 '102',
 '103',
 '104',
 '107',
 '10th',
 '11',
 '11th',
 '12',
 '120',
 '12000',
 '12th',
 '13',
 '13th',
 '14',
 '14th',
 '15',
 '150',
 '15th',
 '16',
 '164',
 '1660-1737',
 '1667',
 '1670',
 '1684.',
 '1689',
 '1691',
 '1691.',
 '1692',
 '1692.',
 '1694',
 '16th',
 '17',
 '176',
 '17th',
 '18',
 '18th',
 '19',
 '19680',
 '19th',
 '1st',
 '2',
 '20',
 '2040',
 '20th',
 '21',
 '21st',
 '21th',
 '22',
 '226',
 '22d',
 '22nd',
 '22th',
 '23',
 '23th',
 '24',
 '240',
 '24th',
 '25',
 '25th',
 '26',
 '26400',
 '26th',
 '27',
 '27120',
 '27th',
 '28',
 '28th',
 '29',
 '29th',
 '2d',
 '3',
 '30',
 '300',
 '3000',
 '30th',
 '31',
 '31800',
 '31st',
 '31th',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '3d',
 '4',
 '40',
 '400',
 '4000',
 '41',
 '42',
 '43',
 '44',
 '4480',
 '45',
 '46',
 '47',
 '48',
 '49',
 '49100',
 '4th',
 '5',
 '5/15',
 '50',
 '51',
 '51000',
 '52',
 '53',
 '54',
 '5400',
 '55',
 '56',
 '57',
 '58',
 '59',
 '5th',
 '6',
 '

Since we're only looking for questionable words at this point, let's eliminate digits and punctuation from the list, to focus on the words. Note that the titlecase words are alphabetically sorted *before* their lowercase brethren.

In [276]:
notinemopwords = []
for i in notinemop1:
    if i.isalpha():
        notinemopwords.append(i)
notinemopwords

['Aaarschot',
 'Aeth',
 'Aix',
 'Albans',
 'Albergotti',
 'Alefeldt',
 'Aleman',
 'Alfeldt',
 'Alsatia',
 'Alseldt',
 'Altholstein',
 'Amand',
 'Ambacht',
 'Ambackt',
 'Amelisweerd',
 'Ammunion',
 'Andit',
 'André',
 'Angloises',
 'Angoumois',
 'Anhalt',
 'Anjou',
 'Anno',
 'Arbauville',
 'Arco',
 'Arents',
 'Arkennes',
 'Artois',
 'Asfeldt',
 'Ath',
 'Athlone',
 'Aubeleterre',
 'Aubois',
 'Augustines',
 'Ausart',
 'Auspach',
 'Auverquerque',
 'Avelghem',
 'Aylua',
 'Barfus',
 'Battal',
 'Bavarians',
 'Bavechein',
 'Beauvesois',
 'Bedmar',
 'Belcastel',
 'Belfonds',
 'Bellasis',
 'Benedictins',
 'Berenburg',
 'Berghem',
 'Bernickow',
 'Bernikow',
 'Bernsdorf',
 'Bernstorf',
 'Bernstort',
 'Bertillac',
 'Berwick',
 'Bessiere',
 'Bezons',
 'Bickenfeldt',
 'Bieck',
 'Bielke',
 'Bilanders',
 'Birkenfeldt',
 'Bissy',
 'Blesois',
 'Boisleduc',
 'Boncourt',
 'Boncourts',
 'Bonef',
 'Bonmale',
 'Bonne',
 'Borja',
 'Bossu',
 'Bouffler',
 'Boufflers',
 'Bourbonnois',
 'Bourgogne',
 'Brabant',
 '

In [277]:
len(notinemopwords)

729

### Write questionable words to txt file

Let's write a copy of the above to a separate txt file, in case we want to look at it elsewhere. We might, for example, want to add these items to another lexicon, or to the dictionary, and then read it in and run the corrections using that `Set flag` code.

In [278]:
with open(processedpath + filename + '_words_to_check.txt','w',encoding = 'UTF-8') as f:
    for i in notinemopwords:
        f.write("%s\n" % i)

If you want to find the context of a word from the above list. For example, we might be confused that `fortifie` is still in `text`, even though `fortifie` is in the `CorrectionRules` lexicon. So we can explore its context to figure out why:

In [279]:
re.findall(r'\b\S*\W*fortifie\W*\S*\b',text)

['be fortified',
 'had fortified',
 'they fortified',
 'was fortified',
 'is fortified',
 'well fortified',
 'have fortified',
 'to fortifie, and',
 'now fortified']

We now realize that I'd padded `fortifie` on both sides with spaces in the lexicon, in order to not inadvertently replace substrings, like turning `fortified` into `fortifyd`. Yet that padding made the code miss `fortifie,` with a trailing comma instead of a space.

If you want to make changes on text, rather than just tokens, issues of substrings and inflections and capitalization will loom large in your decisions: `fortify`, `fortifies`, `fortified`, `fortifying`, `fortification`, `fortifications`, `Fortify`, `fortification.`....

In case you wanted to get a count of how many times a particular (sub)string appears in another string, instead of creating a list and checking its `len`, you can use the `count` method in the `text` string:

In [280]:
text.count('fortifie')

9

## Correct questionable words manually

We can correct any of the above that seem peculiar, with our `textreplace` function; errors that are likely to reappear in other documents should be added to the appropriate lexicon.

But let's hold off on that for a minute, until we deal with the proverbial elephant in the room.

## Dealing with proper nouns

Skimming through the items listed above, we can see that many of the items are not errors, but rather proper nouns, which are, not surprisingly, absent from a generic English word dictionary.

We'll deal with the messy proper nouns later, i.e. standardize their spelling, so `Anverquerque` is the same as `Auverquerque` is the same as `Overkirk` is the same as `Ouwerkerk`. But for now, let's eliminate the proper nouns from our list to focus on words that are most-likely mispelled and need correction. This is easiest if you already have a list of proper nouns that are likely to appear in your document. Until your subfield creates its own lexica to share, it may take you a while to create your own. You can start by cobbling together lists of people, places and organizations from your own resources. You can also use Python code to find titlecased words in your corpus as candidates, and you can also find lists, like gazetteers, online. Perhaps even a scanned book index or two might help.

Read in proper names lexicon, found online and added to.

In [281]:
with open(lexicapath + 'ProperNames.csv', 'r',encoding='UTF-8') as f:
    proper = f.read().split('\n')
proper

["'s Gravenhage",
 "'s Gravenmoer",
 "'s Hertogenbosch",
 "'T Goy-Houten",
 'A. de Boislisle',
 'Aa',
 'Aalbers',
 'Aalfeldt',
 'Aalst',
 'Aaron',
 'Aarschot',
 'Aarsleff',
 'Aartsbisschop',
 'Aartshertog',
 'Aartshertogin',
 'Abadie',
 'Abbadie',
 'Abbass',
 'Abbe',
 'Abbé',
 'Abbenbroek',
 'Abbeville',
 'Abbie',
 'Abbot',
 'Abbott',
 'Abby',
 'Abdel',
 'Abdi',
 'Abdias',
 'Abdis',
 'Abdul',
 'Abdulah',
 'Abdullah',
 'Abdurachman',
 'Abednego',
 'Abel',
 'Abelin',
 'Abelinus',
 'Abell',
 'Abels',
 'Abendana',
 'Abensberg',
 'Abercrombie',
 'Abercromby',
 'Aberdeen',
 'Åberg',
 'Abernethy',
 'Abiel',
 'Abigail',
 'Abingdon',
 'Ablancourt',
 'Able',
 'Abraham',
 'Abrahams',
 'Abram',
 'Absalom',
 'Absmade',
 'Absolom',
 'Abst',
 'Abu',
 'Abulafia',
 'Abyssinia',
 'Abyssinian',
 'Acarete',
 'Accarias',
 'Acerra',
 'Achaeus',
 'Achel',
 'Achen',
 'Acheson',
 'Achille',
 'Achilles',
 'Achinstein',
 'Achmed',
 'Achmet',
 'Ackerman',
 'Ackroyd',
 'Acosta',
 'Acquaviva',
 'Acreigne',
 'Acropo

In [282]:
len(proper)

21815

We'll now combine our dictionary and proper nouns lists together, and lowercase them. We only do this in code (i.e. not combine the two lexica together in a file) because we might want to consider them separately. It's also a matter of version control: if, for example, we want to add a new person, we'll need to remember to add it to both the dictionary file and to the proper noun file. Over time, that may lead to the various lexica getting out of sync, with one version including recent additions while the other doesn't. Better to keep them as separate lists, and combine them together in your code when needed.

In [283]:
combowords = []
for i in emop_en:
    combowords.append(i.lower())
for i in proper:
    combowords.append(i.lower())
combowords

['a',
 "a's",
 "aa's",
 'aachen',
 'aaliyah',
 "aaliyah's",
 'aardvark',
 "aardvark's",
 'aardvarks',
 'aaron',
 "ab's",
 'abaci',
 'aback',
 'abacus',
 "abacus's",
 'abacuses',
 'abaft',
 'abalone',
 "abalone's",
 'abalones',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 "abandonment's",
 'abandons',
 'abase',
 'abased',
 'abasement',
 "abasement's",
 'abases',
 'abash',
 'abashed',
 'abashedly',
 'abashes',
 'abashing',
 'abashment',
 "abashment's",
 'abasing',
 'abate',
 'abated',
 'abatement',
 "abatement's",
 'abates',
 'abating',
 'abattoir',
 "abattoir's",
 'abattoirs',
 'abbas',
 'abbasid',
 'abbe',
 'abbé',
 "abbe's",
 "abbé's",
 'abbes',
 'abbés',
 'abbess',
 "abbess's",
 'abbesses',
 'abbey',
 "abbey's",
 'abbeys',
 'abbot',
 "abbot's",
 'abbots',
 'abbott',
 "abbott's",
 'abbr',
 'abbrev',
 'abbreviate',
 'abbreviated',
 'abbreviates',
 'abbreviating',
 'abbreviation',
 "abbreviation's",
 'abbreviations',
 'abbrevs',
 'abby',
 "abby's",
 'abc',
 "abc's",
 'abcs'

Notice that we lowercased the above words, because we'll want to lowercase our questionable words. This will allow us to find matches regardless of the capitalization of any specific token, whether the token is at the beginning of a sentence, in the middle, or whether the author used old-fashioned Weird Capitalization in the middle of a sentence.

In [284]:
len(combowords)

143752

Then we can see which tokens in our document are not in this combined list (words only).

In [285]:
notindict = []
for i in notinemopwords:
    if i.lower() not in combowords: #notice the lowercasing for comparative purposes only
        notindict.append(i)
notindict = sorted(notindict)
notindict

['Aaarschot',
 'Aeth',
 'Ammunion',
 'Andit',
 'Angloises',
 'Arbauville',
 'Arents',
 'Arkennes',
 'Aubeleterre',
 'Aubois',
 'Augustines',
 'Ausart',
 'Auspach',
 'Avelghem',
 'Aylua',
 'Battal',
 'Bavechein',
 'Beauvesois',
 'Benedictins',
 'Berenburg',
 'Berghem',
 'Bernickow',
 'Bernikow',
 'Bernsdorf',
 'Bernstorf',
 'Bernstort',
 'Bertillac',
 'Bessiere',
 'Bickenfeldt',
 'Bieck',
 'Bielke',
 'Bilanders',
 'Birkenfeldt',
 'Bissy',
 'Blesois',
 'Boisleduc',
 'Boncourt',
 'Boncourts',
 'Bonef',
 'Bonmale',
 'Bonne',
 'Borja',
 'Bouffler',
 'Brancon',
 'Bressey',
 'Bretinchamp',
 'Brigad',
 'Brusten',
 'Bugey',
 'Buldenbrook',
 'Bulo',
 'Bulow',
 'Burthers',
 'Busca',
 'Bussiere',
 'Cadrieux',
 'Cailus',
 'Camerling',
 'Camerlings',
 'Caneghem',
 'Capol',
 'Carabiniers',
 'Carle',
 'Carles',
 'Castres',
 'Caunon',
 'Cavoye',
 'Chaludes',
 'Chapelle',
 'Charots',
 'Chastelet',
 'Chemin',
 'Chenteran',
 'Chinays',
 'Chiney',
 'Chivois',
 'Choviere',
 'Churchil',
 'Cinquilles',
 'Circ

In [286]:
len(notindict)

505

Depending on the size of the list, we can check it and look for things to fix:
1. Spelling errors to fix
2. Names to add to `ProperNouns`. We may want to hold off on these for the moment, till we figure out a better way to correct them programmatically.
2. Technical terms to add to `emop_en`, or a specialized lexicon if you prefer.

First, to see how many are likely proper nouns:

In [287]:
count = 0
for i in notindict:
    if i.istitle():
        count += 1
print(count)
print(count/len(notindict))

452
0.8950495049504951


In [288]:
changesdict['notindict'] = {}

Notice that we still have a lot of place names, which means they need to be added to your `propernames` list. 

But first, consider how you want to standardize them, and how that will be implemented. For me, I want all spelling variations standardized to the modern name, to make it easier to automatically look up the coordinates (geocode). But I'll also want to underscore compond proper nouns in order to keep their tokens together, e.g. `Saint-Omer` and `Saint Omer` will both become `Saint_Omer`. Those tasks will be taken care of in another notebook, so for now I'll simply add any spelling variations I see in the `notindict` list to the `CorrectionRules` lexicon, and then rerun the whole notebook to update the changes. Other lexica will be dedicated to replacing those standardized names with the underscored version. If your sources are like mine, you'll also need to decide what to do about exonyms, i.e. foreign names for toponyms, like `Boiled Duck` for `Bois-le-Duc` for `s'Hertogenbosch` or `Den Bosch`. These exonyms themselves can vary according to contemporaries' irregular spelling practices, which scholars are trying to solve with fuzzy matching and machine learning. Standardizing all these to a single modern spelling will require making a (semi-)arbitrary decision about which is the "offical" version. Linked Open Data resources increasingly offer you the option of translating from one to another, if you know how to incorporate that into your workflow.

If you want to find the context of a specific word, you can use some simple regex.

In [289]:
re.findall(r'\b\S*\W*Chemin\W*\S*\b',text)

['the Chemin des']

If you need more context, you can add additional `\S* ` on either side. For more power, though, you can convert the `text` into an NLTK `Text` object and get the concordance.

In [290]:
nltkwords = word_tokenize(text)
nltkwords

['The',
 'history',
 'of',
 'the',
 'campaign',
 'in',
 'the',
 'Spanish',
 'Netherlands',
 ',',
 'Anno',
 'Dom',
 '.',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne",
 ',',
 'Edward',
 ',',
 '1660-1737',
 '.',
 'By',
 'Edward',
 "D'Auvergne",
 ',',
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade',
 ',',
 'in',
 'the',
 'Isle',
 'of',
 'Jersey',
 ',',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesty',
 "'s",
 'Regiment',
 'of',
 'Scots',
 'Guards',
 '.',
 'London',
 ',',
 'Printed',
 'for',
 'Matt',
 '.',
 'Wotton',
 ',',
 'at',
 'the',
 'Three',
 'Daggers',
 ';',
 'and',
 'John',
 'Newton',
 ',',
 'at',
 'the',
 'Three',
 'Pigeons',
 ',',
 'near',
 'Temple-Barr',
 ',',
 'in',
 'Fleet-street',
 ',',
 '1694',
 '.',
 'Imprimatur',
 ',',
 'Novemb',
 '.',
 '20',
 '.',
 '1694',
 '.',
 'Edward',
 'Cooke',
 '.',
 '[',
 'page',
 ']',
 'To',
 'the',
 'Honourable',
 'Major-General',
 'Ramsay',
 ',',
 'Colonel',
 'of',
 'Their',
 'Majesty',
 "'s",
 'R

In [291]:
nltktext = Text(nltkwords)

In [292]:
nltktext.concordance("Caesar",lines=25)

no matches


## Write `notindict` to `csv`

Let's write a copy of the above revised list to another txt file, in case we want to look at it elsewhere.

In [293]:
with open(lexicapath + filename + '_addtoproper.csv','w') as f:
    for item in notindict:
        f.write("%s\n" % item)

# Doubly doubly entered words

Sometimes we'll find words that appear twice in a row, e.g. `of of`. It's easy for our brains to skip past them, especially if they're stopwords, so let's let the computer find them.

Here we'll use the `enumerate` method on the `tokens` list. We don't want to use the `words` token list, because that has removed certain characters, so any doubled-up words might actually be separated by punctuation. So let's retokenize the revised text.

In [294]:
token_revised = nltk.word_tokenize(text)
token_revised

['The',
 'history',
 'of',
 'the',
 'campaign',
 'in',
 'the',
 'Spanish',
 'Netherlands',
 ',',
 'Anno',
 'Dom',
 '.',
 '1694',
 'with',
 'the',
 'journal',
 'of',
 'the',
 'siege',
 'of',
 'Huy',
 "D'Auvergne",
 ',',
 'Edward',
 ',',
 '1660-1737',
 '.',
 'By',
 'Edward',
 "D'Auvergne",
 ',',
 'M.',
 'A.',
 'Rector',
 'of',
 'St.',
 'Brelade',
 ',',
 'in',
 'the',
 'Isle',
 'of',
 'Jersey',
 ',',
 'and',
 'Chaplain',
 'to',
 'Their',
 'Majesty',
 "'s",
 'Regiment',
 'of',
 'Scots',
 'Guards',
 '.',
 'London',
 ',',
 'Printed',
 'for',
 'Matt',
 '.',
 'Wotton',
 ',',
 'at',
 'the',
 'Three',
 'Daggers',
 ';',
 'and',
 'John',
 'Newton',
 ',',
 'at',
 'the',
 'Three',
 'Pigeons',
 ',',
 'near',
 'Temple-Barr',
 ',',
 'in',
 'Fleet-street',
 ',',
 '1694',
 '.',
 'Imprimatur',
 ',',
 'Novemb',
 '.',
 '20',
 '.',
 '1694',
 '.',
 'Edward',
 'Cooke',
 '.',
 '[',
 'page',
 ']',
 'To',
 'the',
 'Honourable',
 'Major-General',
 'Ramsay',
 ',',
 'Colonel',
 'of',
 'Their',
 'Majesty',
 "'s",
 'R

## Bigram

One way to check for duplicated words is to split the tokens up into bigrams, each word paired with its following neighbor.

In [295]:
bigrams = nltk.bigrams(token_revised)
bigram = []
for bi in bigrams:
    bigram.append(bi)
bigram

[('The', 'history'),
 ('history', 'of'),
 ('of', 'the'),
 ('the', 'campaign'),
 ('campaign', 'in'),
 ('in', 'the'),
 ('the', 'Spanish'),
 ('Spanish', 'Netherlands'),
 ('Netherlands', ','),
 (',', 'Anno'),
 ('Anno', 'Dom'),
 ('Dom', '.'),
 ('.', '1694'),
 ('1694', 'with'),
 ('with', 'the'),
 ('the', 'journal'),
 ('journal', 'of'),
 ('of', 'the'),
 ('the', 'siege'),
 ('siege', 'of'),
 ('of', 'Huy'),
 ('Huy', "D'Auvergne"),
 ("D'Auvergne", ','),
 (',', 'Edward'),
 ('Edward', ','),
 (',', '1660-1737'),
 ('1660-1737', '.'),
 ('.', 'By'),
 ('By', 'Edward'),
 ('Edward', "D'Auvergne"),
 ("D'Auvergne", ','),
 (',', 'M.'),
 ('M.', 'A.'),
 ('A.', 'Rector'),
 ('Rector', 'of'),
 ('of', 'St.'),
 ('St.', 'Brelade'),
 ('Brelade', ','),
 (',', 'in'),
 ('in', 'the'),
 ('the', 'Isle'),
 ('Isle', 'of'),
 ('of', 'Jersey'),
 ('Jersey', ','),
 (',', 'and'),
 ('and', 'Chaplain'),
 ('Chaplain', 'to'),
 ('to', 'Their'),
 ('Their', 'Majesty'),
 ('Majesty', "'s"),
 ("'s", 'Regiment'),
 ('Regiment', 'of'),
 ('of',

In [296]:
len(bigram)

41737

The result is a list of tuples. Now we can loop through each tuple and see if the first and second parts are the same. If they are, save them to another list.

In [297]:
doubly = []
for i in bigram:
    if i[0] == i[1]:
        doubly.append(i)
doubly

[('of', 'of'),
 ('and', 'and'),
 ('Rottembourg', 'Rottembourg'),
 ('Rassent', 'Rassent'),
 ('Cavoye', 'Cavoye'),
 ('Greder', 'Greder'),
 ('Lagny', 'Lagny'),
 ('Dompré', 'Dompré'),
 ('Leveson', 'Leveson'),
 ("O'Farrell", "O'Farrell"),
 ('Schack', 'Schack'),
 ('Boncourt', 'Boncourt'),
 ('the', 'the'),
 ('to', 'to'),
 ('and', 'and'),
 ('and', 'and'),
 ('of', 'of')]

In [298]:
len(doubly)

17

Since some of these are proper nouns, which we'd think might not be as likely to be accidentally duplicated, we should check to see if all of the above are actually in the text:

NB: Not all of the below list items are actually neighbors, given tokenizing issues. `changesdict` will only replace those that are real.

In [299]:
doublystr = []
for i in doubly:
    dubstr = ' '.join(i) #join method converts the bigram tuple elements into str, so we can use regex
    #doublystr.append(re.findall(r'\S*' + dubstr,i))
    #test = re.findall(r'\S*' + dubstr,i)
    if re.search(r'[a-zA-Z]',dubstr):
        doublystr.append(dubstr)
doublystr = sorted(set(doublystr))    
doublystr

['Boncourt Boncourt',
 'Cavoye Cavoye',
 'Dompré Dompré',
 'Greder Greder',
 'Lagny Lagny',
 'Leveson Leveson',
 "O'Farrell O'Farrell",
 'Rassent Rassent',
 'Rottembourg Rottembourg',
 'Schack Schack',
 'and and',
 'of of',
 'the the',
 'to to']

In [300]:
len(doublystr)

14

In [301]:
doublydict = {}
for i in doublystr:
    doublydict[i] = ' ' + i.split()[0] + ' ' #split list item and only keep first element
doublydict

{'Boncourt Boncourt': ' Boncourt ',
 'Cavoye Cavoye': ' Cavoye ',
 'Dompré Dompré': ' Dompré ',
 'Greder Greder': ' Greder ',
 'Lagny Lagny': ' Lagny ',
 'Leveson Leveson': ' Leveson ',
 "O'Farrell O'Farrell": " O'Farrell ",
 'Rassent Rassent': ' Rassent ',
 'Rottembourg Rottembourg': ' Rottembourg ',
 'Schack Schack': ' Schack ',
 'and and': ' and ',
 'of of': ' of ',
 'the the': ' the ',
 'to to': ' to '}

In [302]:
changesdict['doubly'] = {}
text = lexiconreplaceassign('doubly',changesdict,doublydict,text)
changesdict['doubly']

{'and and': ' and ', 'of of': ' of ', 'the the': ' the ', 'to to': ' to '}

In [303]:
edits += '2'
text_2 = text

# Check for other `possiblenonwords`

Read in stopword list and lexica, so only find unexpected words.

We'll write these to a separate file and go through them at our leisure, and then make the substitutions when `newdoc` == `N`.

In [304]:
emop_en

['a',
 "a's",
 "aa's",
 'aachen',
 'aaliyah',
 "aaliyah's",
 'aardvark',
 "aardvark's",
 'aardvarks',
 'aaron',
 "ab's",
 'abaci',
 'aback',
 'abacus',
 "abacus's",
 'abacuses',
 'abaft',
 'abalone',
 "abalone's",
 'abalones',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 "abandonment's",
 'abandons',
 'abase',
 'abased',
 'abasement',
 "abasement's",
 'abases',
 'abash',
 'abashed',
 'abashedly',
 'abashes',
 'abashing',
 'abashment',
 "abashment's",
 'abasing',
 'abate',
 'abated',
 'abatement',
 "abatement's",
 'abates',
 'abating',
 'abattoir',
 "abattoir's",
 'abattoirs',
 'abbas',
 'abbasid',
 'abbe',
 'abbé',
 "abbe's",
 "abbé's",
 'abbes',
 'abbés',
 'abbess',
 "abbess's",
 'abbesses',
 'abbey',
 "abbey's",
 'abbeys',
 'abbot',
 "abbot's",
 'abbots',
 'abbott',
 "abbott's",
 'abbr',
 'abbrev',
 'abbreviate',
 'abbreviated',
 'abbreviates',
 'abbreviating',
 'abbreviation',
 "abbreviation's",
 'abbreviations',
 'abbrevs',
 'abby',
 "abby's",
 'abc',
 "abc's",
 'abcs'

In [305]:
len(emop_en)

121937

In [306]:
with open(lexicapath + 'stopwords_en_edited.txt', 'r',encoding='UTF-8') as f:
    stops = f.read().split('\n')
stops

["'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'t",
 "'tis",
 "'ve",
 'a',
 'aber',
 'able',
 'aboard',
 'about',
 'according',
 'accordingly',
 'across',
 'actually',
 'again',
 'against',
 'al',
 'all',
 'almost',
 'alone',
 'along',
 'alongside',
 'alors',
 'already',
 'als',
 'also',
 'although',
 'always',
 'am',
 'amid',
 'amidst',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'anti',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'ar',
 'are',
 'area',
 'areas',
 "aren't",
 'around',
 'arrayvar',
 'as',
 'aside',
 'at',
 'au',
 'auch',
 'aucuns',
 'auf',
 'aught',
 'aus',
 'aussi',
 'autre',
 'aux',
 'avant',
 'avec',
 'avoir',
 'awfully',
 'b',
 'be',
 'been',
 'beforehand',
 'began',
 'behind',
 'bei',
 'being',
 'beings',
 'believe',
 'below',
 'besides',
 'best',
 'better',
 'big',
 'bin',
 'bis',
 'bist',
 'bon',
 'both',
 'but',
 'by',
 'c',
 'ça',
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'car',
 'case',
 'cases',
 'cause',
 

In [307]:
allwords = list(sorted(set(emop_en + stops + proper)))
allwords1 = []
for i in allwords:
    if i.isalpha():
        allwords1.append(i.lower())
allwords1 = sorted(allwords1)
allwords1

['a',
 'aa',
 'aachen',
 'aalbers',
 'aalfeldt',
 'aaliyah',
 'aalst',
 'aardvark',
 'aardvarks',
 'aaron',
 'aaron',
 'aarschot',
 'aarsleff',
 'aartsbisschop',
 'aartshertog',
 'aartshertogin',
 'abaci',
 'aback',
 'abacus',
 'abacuses',
 'abadie',
 'abaft',
 'abalone',
 'abalones',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandons',
 'abase',
 'abased',
 'abasement',
 'abases',
 'abash',
 'abashed',
 'abashedly',
 'abashes',
 'abashing',
 'abashment',
 'abasing',
 'abate',
 'abated',
 'abatement',
 'abates',
 'abating',
 'abattoir',
 'abattoirs',
 'abbadie',
 'abbas',
 'abbasid',
 'abbass',
 'abbe',
 'abbe',
 'abbenbroek',
 'abbes',
 'abbess',
 'abbesses',
 'abbeville',
 'abbey',
 'abbeys',
 'abbie',
 'abbot',
 'abbot',
 'abbots',
 'abbott',
 'abbott',
 'abbr',
 'abbrev',
 'abbreviate',
 'abbreviated',
 'abbreviates',
 'abbreviating',
 'abbreviation',
 'abbreviations',
 'abbrevs',
 'abby',
 'abby',
 'abbé',
 'abbé',
 'abbés',
 'abc',
 'abcs',
 'abdel',
 'abdi',
 'a

In [308]:
len(allwords1)

112212

In [309]:
tokenset1 = []
tokenset = list(sorted(set(token_revised)))
for i in tokenset:
    if i.isalpha():
        tokenset1.append(i.lower()) # lower gets rid of cased- duplicates
tokenset1

['a',
 'aaarschot',
 'abbess',
 'abbey',
 'abby',
 'about',
 'absolute',
 'accident',
 'accidents',
 'accomplishments',
 'according',
 'accordingly',
 'account',
 'accounts',
 'accoutrements',
 'acquitted',
 'acre',
 'action',
 'actions',
 'active',
 'administration',
 'admiral',
 'advantage',
 'advantages',
 'advice',
 'aeth',
 'affairs',
 'after',
 'afternoon',
 'aggressor',
 'aid',
 'air',
 'aix',
 'alarm',
 'albans',
 'albergotti',
 'alefeldt',
 'aleman',
 'alfeldt',
 'all',
 'allegiance',
 'alliances',
 'allie',
 'allied',
 'allies',
 'alsatia',
 'alseldt',
 'altholstein',
 'amand',
 'ambacht',
 'ambackt',
 'ambuscade',
 'amelisweerd',
 'amended',
 'ammunion',
 'ammunition',
 'among',
 'an',
 'and',
 'andit',
 'andre',
 'andré',
 'angels',
 'angle',
 'angles',
 'angloises',
 'angoumois',
 'anhalt',
 'animosities',
 'anjou',
 'anno',
 'another',
 'answer',
 'apartments',
 'apology',
 'apostolic',
 'approaches',
 'arbauville',
 'arcade',
 'archbishop',
 'architecture',
 'arco',
 'ar

In [310]:
len(tokenset1)

3770

Use python list magic to find those in list1 not in list2

In [311]:
main_list = list(set(tokenset1) - set(allwords1))
main_list = sorted(main_list)
main_list

['aaarschot',
 'aeth',
 'ammunion',
 'andit',
 'andveurnes',
 'angloises',
 'arbauville',
 'arell',
 'arents',
 'arkennes',
 'asit',
 'aubeleterre',
 'aubois',
 'augustines',
 'ausart',
 'auspach',
 'avelghem',
 'aylua',
 'battal',
 'bavechein',
 'beauvesois',
 'benedictins',
 'berenburg',
 'berghem',
 'bernickow',
 'bernikow',
 'bernsdorf',
 'bernstorf',
 'bernstort',
 'bertillac',
 'bessiere',
 'betweenveurnes',
 'bickenfeldt',
 'bieck',
 'bielke',
 'bilanders',
 'birkenfeldt',
 'bissy',
 'blesois',
 'boisleduc',
 'boncourt',
 'boncourts',
 'bonef',
 'bonmale',
 'bonne',
 'borja',
 'bouffler',
 'brancon',
 'bressey',
 'bretinchamp',
 'brigad',
 'brusten',
 'bugey',
 'buldenbrook',
 'bulo',
 'bulow',
 'burthers',
 'busca',
 'bussiere',
 'butit',
 'cadrieux',
 'cailus',
 'camerling',
 'camerlings',
 'caneghem',
 'capol',
 'carabiniers',
 'carle',
 'carles',
 'castres',
 'caunon',
 'cavoye',
 'chaludes',
 'chapelle',
 'charots',
 'chastelet',
 'chemin',
 'chenteran',
 'chinays',
 'chine

In [312]:
len(main_list)

499

In [313]:
main_list_title = [i.title() for i in main_list]
main_list_title

['Aaarschot',
 'Aeth',
 'Ammunion',
 'Andit',
 'Andveurnes',
 'Angloises',
 'Arbauville',
 'Arell',
 'Arents',
 'Arkennes',
 'Asit',
 'Aubeleterre',
 'Aubois',
 'Augustines',
 'Ausart',
 'Auspach',
 'Avelghem',
 'Aylua',
 'Battal',
 'Bavechein',
 'Beauvesois',
 'Benedictins',
 'Berenburg',
 'Berghem',
 'Bernickow',
 'Bernikow',
 'Bernsdorf',
 'Bernstorf',
 'Bernstort',
 'Bertillac',
 'Bessiere',
 'Betweenveurnes',
 'Bickenfeldt',
 'Bieck',
 'Bielke',
 'Bilanders',
 'Birkenfeldt',
 'Bissy',
 'Blesois',
 'Boisleduc',
 'Boncourt',
 'Boncourts',
 'Bonef',
 'Bonmale',
 'Bonne',
 'Borja',
 'Bouffler',
 'Brancon',
 'Bressey',
 'Bretinchamp',
 'Brigad',
 'Brusten',
 'Bugey',
 'Buldenbrook',
 'Bulo',
 'Bulow',
 'Burthers',
 'Busca',
 'Bussiere',
 'Butit',
 'Cadrieux',
 'Cailus',
 'Camerling',
 'Camerlings',
 'Caneghem',
 'Capol',
 'Carabiniers',
 'Carle',
 'Carles',
 'Castres',
 'Caunon',
 'Cavoye',
 'Chaludes',
 'Chapelle',
 'Charots',
 'Chastelet',
 'Chemin',
 'Chenteran',
 'Chinays',
 'Chine

In [314]:
len(main_list_title)

499

In [315]:
possiblenonwords = list(set(main_list)-set(proper))
possiblenonwords = sorted(possiblenonwords)
possiblenonwords

['aaarschot',
 'aeth',
 'ammunion',
 'andit',
 'andveurnes',
 'angloises',
 'arbauville',
 'arell',
 'arents',
 'arkennes',
 'asit',
 'aubeleterre',
 'aubois',
 'augustines',
 'ausart',
 'auspach',
 'avelghem',
 'aylua',
 'battal',
 'bavechein',
 'beauvesois',
 'benedictins',
 'berenburg',
 'berghem',
 'bernickow',
 'bernikow',
 'bernsdorf',
 'bernstorf',
 'bernstort',
 'bertillac',
 'bessiere',
 'betweenveurnes',
 'bickenfeldt',
 'bieck',
 'bielke',
 'bilanders',
 'birkenfeldt',
 'bissy',
 'blesois',
 'boisleduc',
 'boncourt',
 'boncourts',
 'bonef',
 'bonmale',
 'bonne',
 'borja',
 'bouffler',
 'brancon',
 'bressey',
 'bretinchamp',
 'brigad',
 'brusten',
 'bugey',
 'buldenbrook',
 'bulo',
 'bulow',
 'burthers',
 'busca',
 'bussiere',
 'butit',
 'cadrieux',
 'cailus',
 'camerling',
 'camerlings',
 'caneghem',
 'capol',
 'carabiniers',
 'carle',
 'carles',
 'castres',
 'caunon',
 'cavoye',
 'chaludes',
 'chapelle',
 'charots',
 'chastelet',
 'chemin',
 'chenteran',
 'chinays',
 'chine

In [316]:
len(possiblenonwords)

499

Normally, we'd make the changes to `possiblenonwords1` and rerun this notebook with the flag set to `newdoc == 'N'`. But this list might have hundreds of possibilities to go through, many of which might be proper nouns that should be added to another lexicon. So for now, we'll just leave it commented out for now, and, after you've gone through the list, you can uncomment this cell and rerun the notebook.

In [317]:
with open(outputpath + filename + '_possiblenonwords.txt', 'w', encoding = 'UTF-8') as f:
    for item in possiblenonwords:
        f.write("%s\n" % item)

In [318]:
# if newdoc == 'N':
#     with open (outputpath + filename + '_possiblenonwords1.txt','r') as subs:
#             possnonword = subs.read().split('\n')
#     changesdict['possnonword'] ={}
#     text = lexiconreplaceassign('possnonword',changesdict,possnonword,text)
#     changesdict['possnonword']

In [319]:
if newdoc == 'N':
    edits += 'p'
    text_p = text

# Check for non-normal characters

Look through text for additional problematic characters.

These likely require manual correction, so create a separate `to_check` list and go through them manually at end

NB: If the document has an Errata list, EEBO TCP likely changed those, so you should check, and then delete the Errata before reading it in as `text`.

In [320]:
nonnorms = re.findall(r'\S*[^\sa-zA-Z0-9,\.\?;:\'\"\(\)\-!\áàäéèëîïíöü]\S*',text)
nonnorms

['[page]',
 '[page]',
 '[page]',
 '[page]',
 '[page]',
 '[page]',
 '5/15.',
 '[1]',
 '[2]',
 '[3]',
 '[4]',
 '[5]',
 '[6]',
 '[',
 '8.]',
 '[7]',
 '[',
 'May.]',
 '[8]',
 '[9]',
 '[',
 'June]',
 '[10]',
 '[11]',
 '[12]',
 'Françoises',
 '[13]',
 '[14]',
 'Fimarçon',
 '[15]',
 '[16]',
 '[17]',
 '[18]',
 '[19]',
 '—',
 '—de',
 '—',
 '[20]',
 '—',
 '[21]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[26]',
 '[27]',
 '[28]',
 '[29]',
 '[30]',
 '[',
 'July.]',
 '[31]',
 '[32]',
 '[33]',
 '[34]',
 '[',
 'Year.]',
 '[35]',
 'Valençar',
 '[36]',
 '[37]',
 '[38]',
 '[39]',
 '[40]',
 'Salisch▪',
 '[41]',
 '[42]',
 '[43]',
 '[44]',
 '[45]',
 '[46]',
 '[47]',
 '[48]',
 '[49]',
 '[50]',
 '[',
 'August.]',
 '[51]',
 '[52]',
 '[53]',
 '[54]',
 '[55]',
 '[56]',
 '[57]',
 'châtellenie',
 '[58]',
 'châtellenie',
 '[59]',
 '[60]',
 '[61]',
 '[62]',
 '[63]',
 '[64]',
 '[65]',
 '[66]',
 '[67]',
 '[68]',
 '[69]',
 '[70]',
 '[71]',
 '[72]',
 '[73]',
 'châtellenie',
 '[74]',
 '[7

In [321]:
len(nonnorms)

147

NB: Padding (e.g. `\S* \S* ` surrounding) will eliminate a few results that aren't surrounded by spaces

In [322]:
to_check = []

In [323]:
for i in nonnorms:
    to_check.append(i)
to_check

['[page]',
 '[page]',
 '[page]',
 '[page]',
 '[page]',
 '[page]',
 '5/15.',
 '[1]',
 '[2]',
 '[3]',
 '[4]',
 '[5]',
 '[6]',
 '[',
 '8.]',
 '[7]',
 '[',
 'May.]',
 '[8]',
 '[9]',
 '[',
 'June]',
 '[10]',
 '[11]',
 '[12]',
 'Françoises',
 '[13]',
 '[14]',
 'Fimarçon',
 '[15]',
 '[16]',
 '[17]',
 '[18]',
 '[19]',
 '—',
 '—de',
 '—',
 '[20]',
 '—',
 '[21]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[26]',
 '[27]',
 '[28]',
 '[29]',
 '[30]',
 '[',
 'July.]',
 '[31]',
 '[32]',
 '[33]',
 '[34]',
 '[',
 'Year.]',
 '[35]',
 'Valençar',
 '[36]',
 '[37]',
 '[38]',
 '[39]',
 '[40]',
 'Salisch▪',
 '[41]',
 '[42]',
 '[43]',
 '[44]',
 '[45]',
 '[46]',
 '[47]',
 '[48]',
 '[49]',
 '[50]',
 '[',
 'August.]',
 '[51]',
 '[52]',
 '[53]',
 '[54]',
 '[55]',
 '[56]',
 '[57]',
 'châtellenie',
 '[58]',
 'châtellenie',
 '[59]',
 '[60]',
 '[61]',
 '[62]',
 '[63]',
 '[64]',
 '[65]',
 '[66]',
 '[67]',
 '[68]',
 '[69]',
 '[70]',
 '[71]',
 '[72]',
 '[73]',
 'châtellenie',
 '[74]',
 '[7

In [324]:
len(to_check)

147

Save `to_check` to file

In [325]:
with open(outputpath + filename + '_char_to_check_nb1.txt', 'w', encoding = 'UTF-8') as export:
    for i in to_check:
        export.write(i + '\n')

## Fix non-normal characters, add to `changesdict`

In [326]:
with open(outputpath + filename + '_to_check_list.csv', 'w', encoding = 'UTF-8') as f:
    for item in to_check:
        f.write(item + ',' + item + '\n')

You should look at the `to_check_list` file in the outputpath. This will have words that possibly need fixing. Most likely, they won't be obvious replacements that can be programmatically, so you should save it as `to_check_list1`, add the replacements to the `.txt` list, and then read it back into the code as one-more lexicon replacement dict. After you've done that, you can comment out the `blah` if you need to do more tweaking of the code. Note that the next cell, loading in `to_check_list1`, will error if you haven't done this already.

Read edited `to_check_list` back in as dict, and delete those dict entries without values

In [327]:
changesdict['nonnorms'] = {}

In [328]:
if newdoc == 'N':
    with open (outputpath + filename + '_to_check_list1.csv','r') as subs:
        reader = csv.reader(subs)
        nonnormsdict = {rows[0]:rows[1] for rows in reader} # 1st row as key, 2nd row as value
    text = lexiconreplaceassign('nonnorms',changesdict,nonnormsdict,text)

In [329]:
changesdict['nonnorms']

{'[page]': '[page]',
 '5/15.': '5/15.',
 '[1]': '[1]',
 '[2]': '[2]',
 '[3]': '[3]',
 '[4]': '[4]',
 '[5]': '[5]',
 '[6]': '[6]',
 '[': '[',
 '8.]': '8.]',
 '[7]': '[7]',
 'May.]': 'May.]',
 '[8]': '[8]',
 '[9]': '[9]',
 'June]': 'June]',
 '[10]': '[10]',
 '[11]': '[11]',
 '[12]': '[12]',
 'Françoises': 'Françoises',
 '[13]': '[13]',
 '[14]': '[14]',
 'Fimarçon': 'Fimarçon',
 '[15]': '[15]',
 '[16]': '[16]',
 '[17]': '[17]',
 '[18]': '[18]',
 '[19]': '[19]',
 '—': '—',
 '—de': '—de',
 '[20]': '[20]',
 '[21]': '[21]',
 '[22]': '[22]',
 '[23]': '[23]',
 '[24]': '[24]',
 '[25]': '[25]',
 '[26]': '[26]',
 '[27]': '[27]',
 '[28]': '[28]',
 '[29]': '[29]',
 '[30]': '[30]',
 'July.]': 'July.]',
 '[31]': '[31]',
 '[32]': '[32]',
 '[33]': '[33]',
 '[34]': '[34]',
 'Year.]': 'Year.]',
 '[35]': '[35]',
 'Valençar': 'Valençar',
 '[36]': '[36]',
 '[37]': '[37]',
 '[38]': '[38]',
 '[39]': '[39]',
 '[40]': '[40]',
 'Salisch▪': 'Salisch▪',
 '[41]': '[41]',
 '[42]': '[42]',
 '[43]': '[43]',
 '[44]': '[

In [330]:
if newdoc == 'N':
    edits += 'n'
    text_n = text

Recreate `to_check_again` after corrections made above

In [331]:
to_check_again = re.findall(r'\S*[^\sa-zA-Z0-9,\.\?;:\'\"\(\)\-!\áàäéèëîïíöü]\S*',text)
to_check_again

['[page]',
 '[page]',
 '[page]',
 '[page]',
 '[page]',
 '[page]',
 '5/15.',
 '[1]',
 '[2]',
 '[3]',
 '[4]',
 '[5]',
 '[6]',
 '[',
 '8.]',
 '[7]',
 '[',
 'May.]',
 '[8]',
 '[9]',
 '[',
 'June]',
 '[10]',
 '[11]',
 '[12]',
 'Françoises',
 '[13]',
 '[14]',
 'Fimarçon',
 '[15]',
 '[16]',
 '[17]',
 '[18]',
 '[19]',
 '—',
 '—de',
 '—',
 '[20]',
 '—',
 '[21]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[26]',
 '[27]',
 '[28]',
 '[29]',
 '[30]',
 '[',
 'July.]',
 '[31]',
 '[32]',
 '[33]',
 '[34]',
 '[',
 'Year.]',
 '[35]',
 'Valençar',
 '[36]',
 '[37]',
 '[38]',
 '[39]',
 '[40]',
 'Salisch▪',
 '[41]',
 '[42]',
 '[43]',
 '[44]',
 '[45]',
 '[46]',
 '[47]',
 '[48]',
 '[49]',
 '[50]',
 '[',
 'August.]',
 '[51]',
 '[52]',
 '[53]',
 '[54]',
 '[55]',
 '[56]',
 '[57]',
 'châtellenie',
 '[58]',
 'châtellenie',
 '[59]',
 '[60]',
 '[61]',
 '[62]',
 '[63]',
 '[64]',
 '[65]',
 '[66]',
 '[67]',
 '[68]',
 '[69]',
 '[70]',
 '[71]',
 '[72]',
 '[73]',
 'châtellenie',
 '[74]',
 '[7

In [332]:
with open(outputpath + filename + '_char_to_check_nb1.txt', 'w', encoding = 'UTF-8') as export:
    for i in to_check_again:
        export.write(i + '\n')

# Write `_page` version of text

Generally, if you are analyzing an entire text, or if your analysis doesn't depend on pagination, you could make your text one long string. Otherwise, for example, page numbers might be included when you are searching for numbers used by the author in the text. You can always make another copy of the text with the pagination, as backup.

In [333]:
with open(outputpath + filename + '_late_pages.txt', 'w') as f:
    f.write(text)

# Delete line breaks `\n`

If you want to delete all of the extra line breaks, you can use a `list comprehension` to save each line as an item in a list. You then `join` them back together at the very end, back into a string, with a space (or whatever delimiter you want) in between each item.

NB: 
1. This will eliminate any page information. If you want to analyze anything by page, skip this step.
2. If you also want to get rid of paragraph breaks, you can `split` on only a single line break, rather than two.
3. If your text has words that were separated (i.e. hyphenated) across a page break, this will get them closer together, but you'll need to rerun the hyphenation correction. And you might even need to rerun the other corrections if, for example, that hyphenated word split across two pages was spelled wrong, e.g.

`Anver-`

`querque` should really be `Anverquerque`, which should really be `Auverquerque`, which, frankly, should really be `Overkirk` - but that's for another notebook.

In [334]:
text_lines = [line for line in text.split('\n\n') if line.strip()]
text_lines

["The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy\nD'Auvergne, Edward, 1660-1737.\nBy Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.\nLondon, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.\nImprimatur,\nNovemb. 20. 1694.\nEdward Cooke.\n[page]\nTo the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.\nSir,\nI Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great 

NB: I named this version `text2`, in case you want to retain both it and `text` in your code.

In [335]:
text2 = ' '.join(text_lines)
print(text2)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur,
Novemb. 20. 1694.
Edward Cooke.
[page]
To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir,
I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults.

In [336]:
edits = edits + 'l'

In [337]:
text

"The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy\nD'Auvergne, Edward, 1660-1737.\nBy Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.\nLondon, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.\nImprimatur,\nNovemb. 20. 1694.\nEdward Cooke.\n[page]\nTo the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.\nSir,\nI Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great m

# Delete page numbers `[\d]`

This step is also optional, depending on your purposes, and whether you are focused only on the content written by the author, vs. being interested in the layout of the content on the printed page.

NB: If you want to analyze your text by sentence, consider getting rid of the page numbers, since many sentences are split up by line breaks and a page number. This might also be a concern if there is a word that is hyphenated at the bottom of a page.

In [338]:
pagedict = {}
page = re.findall(r'\n\[\d{1,4}\]\n',text)
for i in page:
    pagedict[i] = ''
    print(i)


[1]


[2]


[3]


[4]


[5]


[6]


[7]


[8]


[9]


[10]


[11]


[12]


[13]


[14]


[15]


[16]


[17]


[18]


[19]


[20]


[21]


[22]


[23]


[24]


[25]


[22]


[23]


[24]


[25]


[26]


[27]


[28]


[29]


[30]


[31]


[32]


[33]


[34]


[35]


[36]


[37]


[38]


[39]


[40]


[41]


[42]


[43]


[44]


[45]


[46]


[47]


[48]


[49]


[50]


[51]


[52]


[53]


[54]


[55]


[56]


[57]


[58]


[59]


[60]


[61]


[62]


[63]


[64]


[65]


[66]


[67]


[68]


[69]


[70]


[71]


[72]


[73]


[74]


[75]


[76]


[77]


[78]


[79]


[80]


[81]


[82]


[83]


[84]


[85]


[86]


[87]


[88]


[89]


[90]


[91]


[92]


[93]


[94]


[95]


[96]


[97]


[98]


[99]


[100]


[101]


[102]


[103]


[104]



In [339]:
text = re.sub(r'\n\[\d{1,4}\]\n',' ',text)
print(text)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur,
Novemb. 20. 1694.
Edward Cooke.
[page]
To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir,
I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults.

## Delete `[page]`

In [340]:
pagedict = {}
page = re.findall(r'\[page\]',text)
for i in page:
    pagedict[i] = ''
    print(page.index(i),i)

0 [page]
0 [page]
0 [page]
0 [page]
0 [page]
0 [page]


In [341]:
text = re.sub(r'\[page\]','',text)
print(text)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur,
Novemb. 20. 1694.
Edward Cooke.

To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir,
I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults. For,i

## Delete `Page #`

In [342]:
page = re.findall(r'\n[pP]age \d{1,4}\n{1,2}',text)
for i in page:
    pagedict[i] = ''
    print(i)

In [343]:
text = re.sub(r'\n[pP]age \d{1,4}\n{1,2}',' ',text)
print(text)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy
D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur,
Novemb. 20. 1694.
Edward Cooke.

To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir,
I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults. For,i

In [344]:
edits += 'p'
text_pg = text

# Delete extra line breaks

Sometimes line breaks split up sentences. Also an issue if word needing correcting is at beginning of line and it requires a padded space in lexicon!

In [345]:
re.findall(r'\S*[^\s\.\?]\n\S*',text)

["Huy\nD'Auvergne,",
 'Imprimatur,\nNovemb.',
 'Sir,\nI',
 'so\n',
 'than\n',
 'His\n',
 'of\nSir,',
 'the\n',
 'Wing\nFirst',
 'Lieutenant-Generals,\nDuc',
 'Bourbon,\nMonsieur',
 'Major-Generals,\nDuc',
 "d'Elbeuf,\nDuc",
 '2\nNoailles',
 '2\nDuras',
 '2\nLuxembourg',
 '2\nLorges',
 '2\nGens',
 '1\nchevaux',
 '1\nMontgon',
 '3\nBourbon',
 '2\nLa',
 '2\nVillequier',
 '2\nRottembourg',
 '3\nRoquespine',
 '3\nRohan',
 '2\nPhelipeaux',
 '2\nDauphin',
 '3\nCravates',
 '3\n',
 '37\nSecond',
 '3\nLa',
 '3\nLevis',
 '3\nLa',
 '3\nRassent',
 '3\nManderscheid',
 '3\nVaillac',
 '3\nLa',
 '3\nImécourt',
 '3\nFiene',
 '3\nLa',
 '3\n',
 'Foot\nFirst',
 'Lieutenant-Generals,\nPrince',
 'Conti,\nDuke',
 '3\nLanguedoc',
 '2\nSurville,',
 '4\nCadrieux',
 '3\nToulouse',
 '2\nAlbergotti',
 '2\nRoyal',
 '1\nLa',
 '1\nCaraman',
 '3\nGardes',
 '2\nCharots',
 '2\nHainaut',
 '1\nMotroux',
 "1\nL'Abadie",
 '2\nGardes',
 '2\nVilleroi',
 '2\nRoussillon',
 '2\nDe',
 '2\nPiedmont',
 '3\n',
 '40\nSecond',
 'Lieute

In [346]:
splitline = {}
for i in re.findall(r'\S*[^\s\.\?]\n\S*',text):
    splitline[i] = i.replace('\n',' ')
splitline

{"Huy\nD'Auvergne,": "Huy D'Auvergne,",
 'Imprimatur,\nNovemb.': 'Imprimatur, Novemb.',
 'Sir,\nI': 'Sir, I',
 'so\n': 'so ',
 'than\n': 'than ',
 'His\n': 'His ',
 'of\nSir,': 'of Sir,',
 'the\n': 'the ',
 'Wing\nFirst': 'Wing First',
 'Lieutenant-Generals,\nDuc': 'Lieutenant-Generals, Duc',
 'Bourbon,\nMonsieur': 'Bourbon, Monsieur',
 'Major-Generals,\nDuc': 'Major-Generals, Duc',
 "d'Elbeuf,\nDuc": "d'Elbeuf, Duc",
 '2\nNoailles': '2 Noailles',
 '2\nDuras': '2 Duras',
 '2\nLuxembourg': '2 Luxembourg',
 '2\nLorges': '2 Lorges',
 '2\nGens': '2 Gens',
 '1\nchevaux': '1 chevaux',
 '1\nMontgon': '1 Montgon',
 '3\nBourbon': '3 Bourbon',
 '2\nLa': '2 La',
 '2\nVillequier': '2 Villequier',
 '2\nRottembourg': '2 Rottembourg',
 '3\nRoquespine': '3 Roquespine',
 '3\nRohan': '3 Rohan',
 '2\nPhelipeaux': '2 Phelipeaux',
 '2\nDauphin': '2 Dauphin',
 '3\nCravates': '3 Cravates',
 '3\n': '3 ',
 '37\nSecond': '37 Second',
 '3\nLa': '3 La',
 '3\nLevis': '3 Levis',
 '3\nRassent': '3 Rassent',
 '3\nMan

In [347]:
changesdict['splitline'] = {}
text = lexiconreplaceassign('splitline',changesdict,splitline,text)
changesdict['splitline']

{"Huy\nD'Auvergne,": "Huy D'Auvergne,",
 'Imprimatur,\nNovemb.': 'Imprimatur, Novemb.',
 'Sir,\nI': 'Sir, I',
 'so\n': 'so ',
 'than\n': 'than ',
 'His\n': 'His ',
 'of\nSir,': 'of Sir,',
 'the\n': 'the ',
 'Wing\nFirst': 'Wing First',
 'Lieutenant-Generals,\nDuc': 'Lieutenant-Generals, Duc',
 'Bourbon,\nMonsieur': 'Bourbon, Monsieur',
 'Major-Generals,\nDuc': 'Major-Generals, Duc',
 "d'Elbeuf,\nDuc": "d'Elbeuf, Duc",
 '2\nNoailles': '2 Noailles',
 '2\nDuras': '2 Duras',
 '2\nLuxembourg': '2 Luxembourg',
 '2\nLorges': '2 Lorges',
 '2\nGens': '2 Gens',
 '1\nchevaux': '1 chevaux',
 '1\nMontgon': '1 Montgon',
 '3\nBourbon': '3 Bourbon',
 '2\nLa': '2 La',
 '2\nVillequier': '2 Villequier',
 '2\nRottembourg': '2 Rottembourg',
 '3\nRoquespine': '3 Roquespine',
 '3\nRohan': '3 Rohan',
 '2\nPhelipeaux': '2 Phelipeaux',
 '2\nDauphin': '2 Dauphin',
 '3\nCravates': '3 Cravates',
 '3\n': '3 ',
 '37\nSecond': '37 Second',
 'Foot\nFirst': 'Foot First',
 'Lieutenant-Generals,\nPrince': 'Lieutenant-Gen

In [348]:
print(text)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur, Novemb. 20. 1694.
Edward Cooke.

To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir, I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults. For,i

In [349]:
edits += 'n'
text_sl = text

# `Correction_Rules` again

There may be a better way to do this, but a few extra seconds isn't that big a deal.

In [350]:
changesdict['correct1'] = {}

In [351]:
text = lexiconreplaceassign('correct1',changesdict,correct,text)
changesdict['correct1']

{' staitened': ' straitened',
 'Aaarschot': 'Aarschot',
 'Cologn': 'Cologne',
 'Dendermond': 'Dendermonde',
 'dissentions': 'dissensions',
 'Grimberg': 'Grimbergen',
 'Maréchal ': 'Marshal ',
 'Maréchals ': 'Marshals ',
 'Marshal General': 'Marshal-General',
 'Mehaign': 'Mehaigne',
 'occurr ': 'occur ',
 'seise ': 'seize '}

In [352]:
edits += 'c'
text_c1 = text

## Write `no_pages` output

In case you want to preserve a version of the corrected text without the page breaks, you can do that.

In [353]:
with open(outputpath + filename + '_late_nopages.txt', 'w') as f:
    f.write(text)

# Check for any duplicate letters added by `Correction`

Issue: `xamin'd` -> `examined` without padding leads to `eexamined`

NB: Some acceptable:
1. `i`: Roman numerals; Latin plural (e.g. `Imperii`)
2. Any letters in the middle of words

In [354]:
re.findall(r'\b\S*([a-z])\1{2,}\S*',text)

['e',
 'l',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e',
 'e']

In [355]:
re.findall(r'\be{2}\S*',text)

[]

# `changedict` stats

Now we come to the interesting part. here we can how many corrections were made for each type of error. Remember that since this code has been run on multiple documents with different types of errors, each document will likely only have a subset of errors.

Calculate how many changes in each revision subdict

In [356]:
changesdictstats = {}
for k,v in changesdict.items():
    changesdictstats[k] = len(v)
changesdictstats

{'apost': 0,
 'quote': 0,
 'divisor': 358,
 'longf': 0,
 'labapp': 7,
 'degree': 0,
 'ampersand': 2,
 'asterisk': 3,
 'quest': 0,
 'dollar': 0,
 'square': 13,
 'circle': 10,
 'blankpage': 1,
 'para': 0,
 'unicode': 17,
 'puncspace': 0,
 'duplapost': 0,
 'vv': 3,
 'weirdo': 0,
 'hyphen': 27,
 'allcap': 46,
 'elision': 8,
 'apostspace': 0,
 'apostspaceL': 43,
 'apostspaceR': 4,
 'syncope': 99,
 'split': 0,
 'hyphens': 0,
 'correct': 299,
 'longs': 1,
 'notindict': 0,
 'doubly': 4,
 'nonnorms': 123,
 'splitline': 112,
 'correct1': 12}

# Save outputs

Now we're practically done! At least with these first steps. In case you want to look through your cleaner text one last time:

In [357]:
print(text)

The history of the campaign in the Spanish Netherlands, Anno Dom. 1694 with the journal of the siege of Huy D'Auvergne, Edward, 1660-1737.
By Edward D'Auvergne, M. A. Rector of St. Brelade, in the Isle of Jersey, and Chaplain to Their Majesty's Regiment of Scots Guards.
London, Printed for Matt. Wotton, at the Three Daggers; and John Newton, at the Three Pigeons, near Temple-Barr, in Fleet-street, 1694.
Imprimatur, Novemb. 20. 1694.
Edward Cooke.

To the Honourable Major-General Ramsay, Colonel of Their Majesty's Regiment of Scots Guards, etc.
Sir, I Need not make an Apology for Presenting the Account of the Last campaign to You; for since Custom will have every Trifle that is published, attended with an Epistle Dedicatory, I should be very Ungrateful, if I did not embrace this occasion to acknowledge to the World the many Obligations I have to You: Though, to acquit myself of it, I must put your Honourable Name to a Piece in which I am sensible You must find a great many Faults. For,i

Now we can start saving the output.

## Save cleaned `text` to file

In [358]:
with open(outputpath + filename + '_clean_nb1_' + edits + '_nb1.txt', 'w', encoding = 'UTF-8') as export:
    export.write(text) 

## Save `changedict` as audit trail

In [359]:
with open(outputpath + filename + '_changesdict_' + edits + '_nb1.csv', 'w', encoding = 'UTF-8') as f:
    writer = csv.writer(f)
    for k,v in changesdict.items():
            writer.writerow([k, v])

To come: notebook 2, for more transformative changes. Things like standardizing proper nouns, eliminating weird archaic capitalization, and so much more.

In [360]:
blah

NameError: name 'blah' is not defined

Below is some additional code if you want, for example, to standardize all those unsightly `-eth` verb endings to the appropriate 3rd person singular form.

# Find remaining `-eth` words in cleaned text

Retokenize last version `text`

In [None]:
words1 = word_tokenize(text)
words1

Find any tokens ending with `-eth`

In [None]:
ethtoken = []
for token in words1:
    if token.endswith('eth'):
        ethtoken.append(token)
ethtoken

In [None]:
ethtoke = sorted(set(ethtoken))
ethtoke

In [None]:
len(ethtoke)

# Find all `-eth` words to add to `Corrections`

To add these `-eth` words to a lexicon:

In [None]:
with open(lexicapath + "albemarle1671_addtoproper.csv", 'r',encoding='UTF-8') as f:
    eth = f.read().split('\n')
eth

In [None]:
ethtos = []
for i in eth:
    if i.endswith('eth'):
        ethtos.append(i)
ethtos

In [None]:
len(ethtos)

Turn into a dict to save as csv

In [None]:
ethdict = {}
for i in ethtoke:
    ethdict[i] = i.replace('eth','es')
ethdict

Add titlecase versions to `ethdict`

In [None]:
with open (textpath + 'ethtofix.csv','r') as subs:
    reader = csv.reader(subs)
    ethtofix = {rows[0]:rows[1] for rows in reader} # 1st row as key, 2nd row as value
ethtofix

In [None]:
ethdict1 = {}
for k,v in ethtofix.items():
    ethdict1[k] = v
    ethdict1[k.title()] = v.title()
ethdict1

In [None]:
len(ethdict1)

Save to csv and then clean further - which more common, `-es` or `-s`?

In [None]:
with open(outputpath + 'eth_fixed1.csv', 'w', encoding = 'UTF-8') as f:
    writer = csv.writer(f)
    for k,v in ethdict.items():
            writer.writerow([k, v])

In text editor, add extra comma before line break so will match csv of `Correction_Rules`