# RegEx and Frankenstein
Online RegEx Tester: https://regex101.com/ is a helpful site for learning how to use regular expressions (Regex).

W3schools also has a very useful page about RegEx. https://www.w3schools.com/python/python_regex.asp

Regex's use is very widespread because RegEx is super smart in relation to text processing, because it can be used to perform advanced searches. RegEx is used for search engines and for search and replace functions. Working with RegEx is definitely an experience in itself, but when you get an insight into the scope of tasks that can be solved with RegEx, you realize that it is an incredibly good tool.

This notebook doesn't try to teach you everything about RegEx, but it does try to create learning about it, and only a few of the possibilities are illustrated below.

In addition to RegEx, this notebook contains many loops and list comprehensions, so that way you can also get an insight into how to write this sort of thing.

## Read files
We get the book Frankenstein from Project Gutenberg and use it for the rest of the workshop. We download the Plain Text UTF-8 version.

In [3]:
import urllib.request 
url = 'https://gutenberg.org/cache/epub/84/pg84.txt'
raw_text = urllib.request.urlopen(url).read().decode()
text_start = raw_text.find('*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text_start = text_start + len('*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text_end = raw_text.find('*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***')
text = raw_text[text_start:text_end].strip()

In [4]:
text[0:100]

'Frankenstein;\r\n\r\nor, the Modern Prometheus\r\n\r\nby Mary Wollstonecraft (Godwin) Shelley\r\n\r\n\r\n CONTENTS'

## Metacharacters

\b \S and \w and the + sign

These metadatacharacters are relevant when cleaning text data.

Cleaning the text will often be a fundamental necessity before working on using different methods to analyze the text. Cleaning often consists of removing grammatical characters, ensuring that all uppercase letters are changed to lowercase, and that [stop words](https://en.wikipedia.org/wiki/Stop_word) have been filtered out.

When we need to clean our text of grammatical characters, we use RegEX patterns and Python string methods. Below we test two different RegEX patterns.

When we work with RegEx, the website [regex101.com](https://regex101.com/) is a brilliant tool, because we can get help partly to understand Regex, partly to write a Regex pattern.

Try to open the page and insert this text string: _Chapter 1 I am by birth a Genevese, and my family is one of the most distinguished of that republic._ in the field 'TEXT STRING'.

**First RegEX pattern '\b\S+\b'**.

In the 'REGULAR EXPRESSION' field you can write this pattern _'\b\S+\b'_.

\b : \b finds the position at the boundary of a word (word boundary). \S: \S matches any non-space +: + matches the previous character between one and an unlimited number of times, as many times as possible until the next character. They say the plus is greedy. \b : \b finds the position at the boundary of a word (word boundary).

When you set \b\S+\b, you match from, you match all "non-space characters" as well as underscores, but not symbols such as periods, commas, question marks.

**Other RegEX pattern '\w+'**.
Another Regex pattern that is also used for cleaning is '\w+'.
\w: \w matches any alphabetic letter (uppercase and lowercase), any number, or an underscore (_). +: + matches the preceding character one or more times.

When you put \w+ together, you match whole words composed of letters, digits and underscores.

**Comparison of the two RegEX patterns**

In the first method, _There-for_ remains in a word

In the second method it becomes two words, _There for_.

Search e.g. after _About two o'clock_.

In the first method, _o'clock_ remains a word.

In the second method it becomes two words, _o clock_.

Both methods leave us with underscores (_), so to get rid of underscores we use stitch's .replace() method.

Try both methods below and inspect the result.

In [5]:
import re

def clean_text_1(text):
    # Use \w+ regex pattern to extract words
    words = re.findall(r'\b\S+\b', text)

    # Join the extracted words into a cleaned text
    cleaned_text = ' '.join(words)

    return cleaned_text


def clean_text_2(text):
    # Use \w+ regex pattern to extract words
    words = re.findall(r'\w+', text)

    # Join the extracted words into a cleaned text
    cleaned_text = ' '.join(words)

    return cleaned_text

In [6]:
cleaned_text = clean_text_1(text)

print(cleaned_text[0:100])

Frankenstein or the Modern Prometheus by Mary Wollstonecraft Godwin Shelley CONTENTS Letter 1 Letter


In [7]:
cleaned_text = clean_text_2(text)

print(cleaned_text[0:100])

Frankenstein or the Modern Prometheus by Mary Wollstonecraft Godwin Shelley CONTENTS Letter 1 Letter


## w+ along with \b

Find words with special endings, e.g. _day_, can be a help to gain insight into where and when the literature takes place.

Why doesn't anything happen on a Friday?

You can also use the endings to find grammatical forms, e.g. words with a long affix will be relatively easy to identify.

In [8]:
ending = re.findall(r'\w+day\b', text)
print(ending)

['yesterday', 'holiday', 'Monday', 'Yesterday', 'Sunday', 'Thursday', 'today', 'today', 'yesterday', 'yesterday', 'everyday']


## More metacharacters, as well as pipes, lists and question marks

In literature, comparisons are often used to illustrate points more clearly by putting pictures on what you want to describe. Comparisons also contribute to making the text more lively and interesting.

But regex makes it a manageable task to retrieve examples of comparisons in Grimm's fairy tales, because we can find text strings that follow the pattern of a typical comparison.

We can illustrate it in the following way. We look for phrases whose pattern is either as a ... or as an ....

The RegEx pattern can be written like this:

'as\sa\s\w+'

The word 'as' is followed by \s, meaning white space, followed by a, then followed by \s, followed by \w, meaning word charater, followed by + meaning "one or more of the previous".


If you also want to search for "as an ..." there are two ways to do it.


First way is to use pipe |. Pipe means "or". The regex pattern will then look like this: 'as\sa\s\w+|as\san\s\w+'

Another way is to use the square brackets [ ].

It looks like this: 'as\sa[n]?\s\w+'. In the list, letters can be added that can stand in that place in the word. The question mark indicates that the letter may or may not be there.

In [21]:
comparison = re.findall(r'as\sa\s\w+', cleaned_text)
print (comparison)

['as a steady', 'as a child', 'as a most', 'as a Turk', 'as a remarkably', 'as a human', 'as a little', 'as a brother', 'as a double', 'as a halo', 'as a merchant', 'as a considerable', 'as a sense', 'as a show', 'as a fair', 'as a restorative', 'as a necessity', 'as a German', 'as a boy', 'as a promise', 'as a deformed', 'as a strong', 'as a narrow', 'as a little', 'as a dream', 'as a certain', 'as a bold', 'as a mystery', 'as a most', 'as a proof', 'as a tendency', 'as a divine', 'as a widow', 'as a servant', 'as a great', 'as a judgement', 'as a Roman', 'as a miniature', 'as a new', 'as a strange', 'as a girl', 'as a proof', 'as a dire', 'as a murderer', 'as a wretch', 'as a creature', 'as a murderess', 'as a wreck', 'as a lullaby', 'as a poor', 'as a new', 'as a little', 'as a small', 'as a lovely', 'as a lady', 'as a guide', 'as a vagabond', 'as a Turkish', 'as a Christian', 'as a boarder', 'as a distant', 'as a listener', 'as a true', 'as a luxury', 'as a fool', 'as a recompense'

In [22]:
comparison = re.findall(r'as\sa\s\w+|as\san\s\w+', cleaned_text)
print (comparison)

['as a steady', 'as a child', 'as an under', 'as a most', 'as a Turk', 'as a remarkably', 'as a human', 'as a little', 'as a brother', 'as a double', 'as a halo', 'as a merchant', 'as a considerable', 'as a sense', 'as a show', 'as a fair', 'as a restorative', 'as an infant', 'as a necessity', 'as a German', 'as a boy', 'as an inferior', 'as a promise', 'as a deformed', 'as a strong', 'as a narrow', 'as an uncouth', 'as a little', 'as a dream', 'as a certain', 'as an easier', 'as a bold', 'as a mystery', 'as a most', 'as a proof', 'as a tendency', 'as a divine', 'as an odious', 'as a widow', 'as a servant', 'as a great', 'as a judgement', 'as a Roman', 'as an irresistible', 'as an historical', 'as an air', 'as a miniature', 'as a new', 'as a strange', 'as a girl', 'as a proof', 'as a dire', 'as a murderer', 'as a wretch', 'as a creature', 'as a murderess', 'as a wreck', 'as a lullaby', 'as a poor', 'as a new', 'as a little', 'as a small', 'as a lovely', 'as a lady', 'as a guide', 'as a

In [23]:
comparison = re.findall(r'as\sa[n]?\s\w+', cleaned_text)
print (comparison)

['as a steady', 'as a child', 'as an under', 'as a most', 'as a Turk', 'as a remarkably', 'as a human', 'as a little', 'as a brother', 'as a double', 'as a halo', 'as a merchant', 'as a considerable', 'as a sense', 'as a show', 'as a fair', 'as a restorative', 'as an infant', 'as a necessity', 'as a German', 'as a boy', 'as an inferior', 'as a promise', 'as a deformed', 'as a strong', 'as a narrow', 'as an uncouth', 'as a little', 'as a dream', 'as a certain', 'as an easier', 'as a bold', 'as a mystery', 'as a most', 'as a proof', 'as a tendency', 'as a divine', 'as an odious', 'as a widow', 'as a servant', 'as a great', 'as a judgement', 'as a Roman', 'as an irresistible', 'as an historical', 'as an air', 'as a miniature', 'as a new', 'as a strange', 'as a girl', 'as a proof', 'as a dire', 'as a murderer', 'as a wretch', 'as a creature', 'as a murderess', 'as a wreck', 'as a lullaby', 'as a poor', 'as a new', 'as a little', 'as a small', 'as a lovely', 'as a lady', 'as a guide', 'as a

## Curly brackets

Curly brackets are for example relevant when making a concordance (word and context).

We want to find excerpts of text that contain storm, because we are actually interested in pointing down in the text and seeing how exactly the terms are used.

For this we need to use the full stop ( . ) because it gives us more word characters and {30} searches for us to get 30 word characters before we hit the letters Turk.

The period {30} after Turk gives us another 30 word characters.

Try to see if you can use some of what has been reviewed above to include text extracts that contain the word Roman..

In [24]:
re.findall(r'.{30}storm.{30}', cleaned_text)

['st be his story frightful the storm which embraced the gallant ve',
 't violent and terrible thunderstorm It advanced from behind the m',
 ' heavens I remained while the storm lasted watching its progress ',
 ' of preservation to avert the storm that was even then hanging in',
 'he most beautiful figures The storm appeared to approach rapidly ',
 ' on although the darkness and storm increased every minute and th',
 ' from the preceding flash The storm as is often the case in Switz',
 ' the heavens The most violent storm hung exactly north of the tow',
 ' the village of Copêt Another storm enlightened Jura with faint f',
 'y retreats What were rain and storm to me My mule was brought to ',
 'ning to rise Suddenly a heavy storm of rain descended I had been ']

## Square brackets [A-Z]

Find words that start with capital letters

In [25]:
upper_case_word = re.findall(r'[A-Z]\w+', text)
print (upper_case_word[0:100])

['Frankenstein', 'Modern', 'Prometheus', 'Mary', 'Wollstonecraft', 'Godwin', 'Shelley', 'CONTENTS', 'Letter', 'Letter', 'Letter', 'Letter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Letter', 'To', 'Mrs', 'Saville', 'England', 'St', 'Petersburgh', 'Dec', 'You', 'London', 'Petersburgh', 'Do', 'This', 'Inspirited', 'There', 'Margaret', 'There', 'Its', 'What', 'These', 'But', 'These', 'This', 'North', 'Pacific', 'Ocean', 'You', 'Uncle', 'Thomas', 'My', 'These', 'These', 'Homer', 'Shakespeare', 'You', 'But', 'Six', 'North', 'Sea', 'Twice', 'Greenland', 'And', 'Margaret', 'My', 'Oh', 'My', 'This', 'Russia', 'They', 'English', 'The', 'St', 'Petersburgh', 'Archangel', 'June', 'Ah', 'If', 'If', 'Farewell', 'Margaret', 'Heaven', 'Your', 'Walton', 'Letter']


Mange af disse ord er skrevet med stort, fordi de optræder efter et punktum, og på den måde er de ikke, hvad jeg vil kalde for "ægte" ord med stort.

Hvis man vil bortfiltrere de "uægte" ord fra sin liste, så kan man afsløre dem ved at lave et loop og indsætte en betingelse, der kan tjekke om, ordene skulle være skrevet med småt andre steder i teksterne, fordi hvis de er det, så er de "uægte".

In [26]:
true_upper_case = []
for word in upper_case_word:
    if word.lower() not in text:
        true_upper_case.append(word)
print (true_upper_case[0:100])

['Frankenstein', 'Prometheus', 'Mary', 'Wollstonecraft', 'Godwin', 'Shelley', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Chapter', 'Mrs', 'Saville', 'England', 'Petersburgh', 'London', 'Petersburgh', 'Margaret', 'Pacific', 'Thomas', 'Homer', 'Shakespeare', 'Greenland', 'Margaret', 'Russia', 'English', 'Petersburgh', 'June', 'Ah', 'Margaret', 'Walton', 'Mrs', 'Saville', 'England', 'Margaret', 'Thomas', 'Englishman', 'Russian', 'Turk', 'Africa', 'America', 'Robert', 'Walton', 'Mrs', 'Saville', 'England', 'July', 'England', 'England', 'Adieu', 'Margaret', 'Mrs', 'Saville', 'England', 'August', 'Monday', 'July', 'European', 'English', 'Margaret', 'Margaret', 'August', 'August', 'Walton', 'Chapter', 'Genevese', 'Beaufort', 'Lucerne', 'Beaufort', 'Beaufort', 'Reuss', 'Beaufort', 'Caro