## Workshop: Introduction to Regular Expressions with Python for Digital Humanities (Part I)

Leo (Lizhou) Fan

Acknowledgement: Dr. Ashley Sanders Garcia

### Import Packages the Example File
A pseudo OCR document

In [3]:
f = open('example.txt', 'r')
file = f.read()
f.close()

In [4]:
file

'Aaaaa,      bbb.\nLF:ccc  ddd,\neeeeee. f\n-1-\n\nH. AG:I j \nK. lllll mn\n        -2\n\nOPQ345634\nR         AG:ST\nUV4W\n---3-\n\n\n'

### 0. Regex Implementation in Python

In [5]:
import re

You can search for patterns matching what you want, you can use `re.findall()` function:

In [6]:
# example: finding all the digits in the example file
re.findall(r'\d',file)

['1', '2', '3', '4', '5', '6', '3', '4', '4', '3']

There are more functions avaliable for [performing matches](https://docs.python.org/3/howto/regex.html#performing-matches).

### 1. Removing Page Numbers
Page numbers here are with a pattern: a `-` character is **before** all of them. We can then replace digits with this pattern using `re.sub()` function:

In [8]:
file = re.sub(r'-\d','',file)

In [9]:
file

'Aaaaa,      bbb.\nLF:ccc  ddd,\neeeeee. f\n-\n\nH. AG:I j \nK. lllll mn\n        \n\nOPQ345634\nR         AG:ST\nUV4W\n---\n\n\n'

The page numbers disapear while keep all othet digits.

##### Warning: Regex is a pretty case specific techique. Some of the methods introduced below may need modifications in other sotuation.

### 2. Joining Lines

Clean the `\n` and `-` characters. Again, we use the `re.sub` function. Notice that this time, we use a space as the target to seperate words.

In [17]:
file = re.sub(r'(\n)|-',' ',file)

In [18]:
file

'Aaaaa,      bbb. LF:ccc  ddd, eeeeee. f    H. AG:I j  K. lllll mn           OPQ345634 R         AG:ST UV4W       '

##### Warning: If lines are stores as a list, some more advanced fucntions are needed. Please expect this in our next session.

The problem now is: there are too many spaces. We than clean the spaces by leaving only one space between two words and no space at the end of the word or the string. Here, `+` is a Quantifier used for representing the appearing of a certain thing for once or more.

In [21]:
file = re.sub(r' +',' ',file)

In [22]:
file

'Aaaaa, bbb. LF:ccc ddd, eeeeee. f H. AG:I j K. lllll mn OPQ345634 R AG:ST UV4W '

There are more knowledge about [repeating things](https://docs.python.org/3/howto/regex.html#repeating-things).

### 3. Splitting Documents with Conditions

In this pseudo OCR document, we assume that a `:` proceed by two initials are correct and represent the name of the speakers. If there is no initials before a sentence, it is might be the introduction of this ducoment. Sometimes, punctuations are missing (to mimic the complexity of OCR documents).  

Our task here is to split the string of the document into a list with different sentences.

In [26]:
re.split(r'[A-Z]{2}:',file)

['Aaaaa, bbb. ',
 'ccc ddd, eeeeee. f H. ',
 'I j K. lllll mn OPQ345634 R ',
 'ST UV4W ']

One draw back of doing so is that we lose the information of the initials. We can use more complex Regex skills regarding grouping to maintian them. We can also use [Pandas dataframes](https://www.geeksforgeeks.org/python-pandas-dataframe/) to store this kind of information.

**Question:** Do you spot any other problem in the above list? Is there anything that do not comply to your understanding of "sentences"?

##### Is the above pretty easy? Maybe, maybe not... There are more things to learn:

### 4. Correcting Common OCR Errors
Many posts are doing a pretty good job in introducting this topic:
1. [Cleaning OCR’d text with Regular Expressions](https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions) by Laura Turner O'Hara
2. [Generating an Ordered Data Set from an OCR Text File](https://programminghistorian.org/en/lessons/generating-an-ordered-data-set-from-an-OCR-text-file) by Jon Crump  
3. [Using regular expressions to clean and process OCR data](https://www.meredithpaker.com/updates/regexcleaning) by Meredith M. Paker
4. [TroveKleaner: a Knime workflow for correcting OCR errors](http://seenanotherway.com/trovekleaner/) by Angus Veitch

See Tasks in the slides about what you need to explore with the four (and maybe more) posts!

Appendix: Suggested readings
1. [Regex HOWTO](https://docs.python.org/3/howto/regex.html#regular-expression-howto) by A.M. Kuchling introduced basic regular expressions in Python. Two of the above links are from sections in this post.
2. You can also download a [Regex Cheat Sheet](https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/) by David Child. This is by far one of the most popular cheat sheets about Regex.

##### Good Luck!