### Regular expressions

Regular expressions (regex) are patterns which match parts of text. Optionally, they can also replace. They are powerful ways of finding and changing strings.

The most important point is that regular expressions are not specific to Python. They're ubiquitous in programming and usually available in any piece of software that works with sequences of characters.

Data preparation is a large part of most DH projects. Regex is a key tool, whatever your software or programming language. Learn regex once: use everywhere!

Regex has limitations. We'll come back to the at the end, but regex only works with strings. In regex, the number _5_ is a string, not an integer. That means that you can find sequences of numbers but with regex alone you can't increment those numbers or do other mathematical operations on them.

Fortunately a programming language can do that. So if you combine regex with something like Python you can have the best of both.

Let's read in _Persuasion_ again, as we did last week. Make sure it's in the same folder as this notebook, or add a path that points to it:

In [2]:
with open('persuasion.txt', 'r', encoding="utf-8") as f:
    persuasion = f.read()

Last week we had a really awkward way of looking for the context in which 'Anne' occurs:

In [3]:
persuasion.find("Anne")

2133

In [4]:
persuasion[2113:2153]

' born\nJune 1, 1785; Anne, born August 9,'

To get the next occurrence we'd have to look in the slice after 2133 (note that parts of a slice can be omitted).

In [5]:
persuasion[2153:].find("Anne")

4320

But...

In [7]:
persuasion[4300:4340]

'ealed his failings, and promoted his rea'

We could programmatically loop through the text and look for the string we want in each fragment. Here's an approach to splitting the text in Python, using a ```while``` loop. But this is getting complicated immediately.

In [11]:
offset = 0
while offset < 1000:
    persuasion_chunk = persuasion[offset:offset + 100]
    print(f"{offset}:{offset + 100}") # show the chunk range
    print(persuasion_chunk)
    offset += 100
    #print("finished")

0:100
﻿The Project Gutenberg eBook of Persuasion
    
This ebook is for the use of anyone anywhere in the 
100:200
United States and
most other parts of the world at no cost and with almost no restrictions
whatsoeve
200:300
r. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License inclu
300:400
ded with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you
400:500
 will have to check the laws of the country where you are located
before using this eBook.

Title: P
500:600
ersuasion

Author: Jane Austen

Release date: February 1, 1994 [eBook #105]
                Most rec
600:700
ently updated: September 10, 2023

Language: English

Credits: Sharon Partridge and Martin Ward
    
700:800
    Revised by Richard Tonsing.


*** START OF THE PROJECT GUTENBERG EBOOK PERSUASION ***




Persua
800:900
sion


by Jane Austen

(1818)




Contents


 CHAPTER I.
 CHAPTER II.
 CHAPTER III.
 CHAPTER IV.
 CH
900:1000
APTER V.
 CH

We're only looking for literal strings so we can't ask for the context around a string like _Anne_ because we don't know in advance what that context is. This is where _regular expressions_, also called _regex_ can help.

First we need to import Python's ```re``` module. It's part of the standard library so it will be installed in any normal installation of Python.

In [12]:
import re

Now we can use regex, in which some characters are _literal_ and some are _special_. Nearly every regex has a combination of both.

But let's start just by running up a regex that only looks for the literal string _Anne_. Note that, because we imported the whole ```re``` library, we have to refer to ```re.findall```, not just ```findall```. 

We'll read the results into a variable called ```anne_context```.

In [13]:
anne_context = re.findall(r"Anne", persuasion)

The results are now held in ```anne_context```.

In [14]:
anne_context

['Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 

A key special character in regex is ```.``` and it means _any character_ so we can add this either side of our literal _Anne_ string to get the context, eg 10 characters either side.

In [18]:
anne_context = re.findall(r"..........Anne..........", persuasion)
anne_context

[' 1, 1785; Anne, born Aug',
 'rove; but Anne, with an ',
 's only in Anne that she ',
 's before, Anne Elliot ha',
 ' growing. Anne haggard, ',
 'father of Anne and her s',
 'dation of Anne’s had bee',
 ' on which Anne wanted he',
 'ntry. All Anne’s wishes ',
 'l fate of Anne attended ',
 ' her dear Anne’s known w',
 'bourhood. Anne herself w',
 'regard to Anne’s dislike',
 'What Miss Anne says, is ',
 'lace; and Anne, after th',
 'hed, than Anne, who had ',
 'think of! Anne Elliot, s',
 'ths ended Anne’s share o',
 're, while Anne was ninet',
 'his case, Anne had left ',
 'sness for Anne’s being t',
 ' point of Anne’s conduct',
 'd to; but Anne, at seven',
 'ent could Anne Elliot ha',
 'nced that Anne would not',
 'shed, and Anne though dr',
 ' claiming Anne when anyt',
 'o without Anne,” was Mar',
 'I am sure Anne had bette',
 ' all; and Anne, glad to ',
 'tled that Anne should no',
 'tained to Anne, in Mrs C',
 'se, while Anne could be ',
 '” replied Anne, “which a',
 'lves, and An

What are we getting back from Python here? Is it a string? How can we check?

Because this turns out to be a list, we can use slicing again. For example to get the last mentions of Anne in _Persuasion_ we can do this:

In [24]:
anne_context[-5:]

['y keeping Anne with her ',
 'in seeing Anne restored ',
 'is cousin Anne’s engagem',
 'ffices by Anne had been ',
 'er heart. Anne was tende']

If you're ever unsure about the syntax for lists, create a small list of your own to check your intuition.

In [27]:
mylist = [1, 2, 3, 4, 5, 6]
mylist[-5:]

[2, 3, 4, 5, 6]

With the regex, we can always add or subtract more full points to get more or less context.

But there is a problem with the results of the regex. Since this is a list, we can get its length:

In [25]:
len(anne_context)

291

In [26]:
persuasion.count("Anne")

496

These kinds of sense checks are good to build in to your thinking, and your code, as much as possible.

The next special character we'll use is ```?```, meaning _one or none_ of the preceding characters.

If we're not sure of the spelling of _Anne_ we can now allow for _Ann_ as well


In [28]:
anne_context = re.findall(r"..........Anne?..........", persuasion)

We can also use ```[^]``` to ask for any characters other than the ones after the ```^``` symbol.

In [29]:
no_anne_context = re.findall(r".......Ann[^e]+......", persuasion)

In [30]:
no_anne_context

[]

But we can also use this to make the characters around our string option. This is pretty crude but let's do it anyway:

In [31]:
anne_context = re.findall(r".?.?.?.?.?.?.?.?Anne.?.?.?.?.?.?.?.?.?.?", persuasion)

In [32]:
len(anne_context)

494

A much better way is to give a range of how many characters we want to match, using ```{``` and ```}```.

In [33]:
anne_context = re.findall(r".{0,20}Anne.{0,20}", persuasion)

In [34]:
len(anne_context)

493

We're still missing three...

Last week we weren't sure if there were characters called _Annette_ or places called _Annecy_ in the text. With regex we can check that. Let's look for _A_ followed by any number of lower-case letters.

Square brackets represent a _character class_, meaning _any one of these in any order_. ```[a-z]``` is a convenience to save you from typing ```[abcdefghijklmnopqrstuvwxyz]``` every time.

```+``` is like the ```?``` we saw above, but it means _one or more_.

In [36]:
capital_a = re.findall(r"A[a-z]+", persuasion)
capital_a

['Author',
 'Austen',
 'Austen',
 'Anne',
 'August',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'All',
 'Anne',
 'Always',
 'As',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'All',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'And',
 'Anne',
 'Anne',
 'Anne',
 'As',
 'After',
 'Anne',
 'Anne',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'And',
 'Admiral',
 'Anne',
 'Admiral',
 'Admiral',
 'And',
 'Admiral',
 'At',
 'After',
 'Anne',
 'As',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'An',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'All',
 'Anne',
 'Admiral',
 'Admiral',
 'Anne',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'And',
 'Anne',
 'Anne',
 'Anne',
 'Admiral',
 'Anne',
 'Accordingly',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Anne',
 'Ann

In [37]:
capital_a.sort()
capital_a

['About',
 'About',
 'Abydos',
 'Accordingly',
 'Additional',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiral',
 'Admiralty',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'After',
 'Af

In [38]:
set(capital_a)

{'About',
 'Abydos',
 'Accordingly',
 'Additional',
 'Admiral',
 'Admiralty',
 'After',
 'Again',
 'Ah',
 'Alarming',
 'Alas',
 'Alicia',
 'All',
 'Allowances',
 'Altered',
 'Always',
 'An',
 'And',
 'Anne',
 'Another',
 'Anxious',
 'Any',
 'Anybody',
 'Anything',
 'Archibald',
 'Archive',
 'Are',
 'As',
 'Asp',
 'At',
 'Atkinson',
 'Atlantic',
 'August',
 'Austen',
 'Author',
 'Ay',
 'Aye'}

So there are, apparently, characters called _Alicia_, _Archibald_ and _Atkinson_.

In [39]:
len(set(capital_a))

37

Can we use regex to look at all the verbs associated with Anne in _Persuasion_? Here's a first attempt:

In [40]:
annes_verbs = re.findall(r"Anne [^ ]+ed\W", persuasion)

In [42]:
annes_verbs

['Anne admired ',
 'Anne attended ',
 'Anne avoided ',
 'Anne conceived ',
 'Anne convinced ',
 'Anne delighted ',
 'Anne distinguished ',
 'Anne enquired ',
 'Anne entered ',
 'Anne felt\npersuaded,',
 'Anne followed ',
 'Anne hazarded ',
 'Anne hoped ',
 'Anne hoped ',
 'Anne included ',
 'Anne listened,',
 'Anne longed ',
 'Anne looked ',
 'Anne mentioned ',
 'Anne named ',
 'Anne offered ',
 'Anne presumed,',
 'Anne recollected ',
 'Anne restored ',
 'Anne sighed ',
 'Anne smiled ',
 'Anne smiled ',
 'Anne smiled,',
 'Anne struggled,',
 'Anne suppressed ',
 'Anne talked ',
 'Anne ventured ',
 'Anne ventured ',
 'Anne viewed ',
 'Anne walked ',
 'Anne walked ',
 'Anne walked ',
 'Anne wanted ',
 'Anne wondered ']

These aren't, of course, all of Anne's verbs. Regex only operates on sequences of characters.

We've now seen quite a lot of the regex syntax you'll ever need to find things with. To sum up:

```.``` any character

```+``` one or more of the preceding (by default, matches as much as possible: 'greedy')

```?``` one or none of the preceding

```*``` one or none of the preceding (by default, matches as much as possible: 'greedy')

```[]``` a character class, 'find any of these, in any order'

```[^]``` a negated character class 'find anything that is not one of these'

What about if you want to find literal versions of the above, like a literal full stop?

Put a ```\``` in front of it to _escape_ it: make it not special. For example ```\?``` matches a literal question mark.

#### some shortcuts

```\w``` any non-whitespace character

```\W``` any whitespace character, including punctuation

```[0-9]``` any number

```[a-z]``` any lowercase letter

```[A-Z]``` any uppercase letter

But if you're new to regex this will still be a lot to take in. Practice is the only way to learning regex, so don't worry. The key thing is to remember that there are many situations where regex will make your life easier and you can look up the syntax any time you need to.

Last week, splitting on whitespace was too crude for us to get all the words from _Persuasion_. Regex allows us to fix that.

In [None]:
persuasion_words = re.findall(r'\w+', persuasion)

In [None]:
biggest_words = sorted(persuasion_words, key=len, reverse=True)

In [None]:
biggest_words[:10]

In [None]:
from collections import Counter


In [None]:
mycounts = Counter(persuasion_words)
mycounts.most_common(10) # or whatever number required

What about replacing? For that, in Python, we use ```re.sub```. It works the same way as ```findall``` but we need an extra argument to the function: the thing we want to put in place of what we found. As always, the simplest possible example is a good place to start.

In [55]:
sample = "Anne Elliot"
print(sample)
sample = re.sub(r"Anne? El+iot+", "the principal character", sample)
print(sample)

Anne Elliot
the principal character


The most powerful part of replacement is re-using parts of the find, for example to add to them or move them around.

To do this, put round brackets around a part of the regex you want to recall in the replacement, this is known as a _capture group_.

In the replacement text the contents of the first set of brackets are referred to with ```\\1```, the second set as ```\\2``` and so on. In regex this is known as a _back reference_.

In [56]:
sample = "Anne Elliot"
print(sample)
print("But let's swap the names around:")
sample = re.sub(r"(Anne?) (El+iot+)", "\\2, \\1", sample)
print(sample)

Anne Elliot
But let's swap the names around:
Elliot, Anne


When not to use regular expressions.

Because regex work on strings, they cannot reliably _parse_ data, that is work with its structure.

Once you get good at regex, you might be tempted to use it to parse structured data. Here's a simple example of data in _CSV_ (_comma-separated values_) format:
```
character,novel,occurence_count
Anne Elliot,Persuasion,486
Emma Woodhouse,Emma,397
Elizabeth Bennett,Pride and Prejudice,292
Fanny Price,Mansfield Park,331
```

If you try to extract the middle column you will be trying to parse the data with regex. This is highly unreliable and inadvisable.

### Group work

#### finding

1. Find the context around another main character in _Persuasion_, Captain Wentworth. 

2. Does Captain ever get abbreviated to _Capt._?

3. Find the word following _Anne_. Can you make a unique list of these? Can you take account of punctuation between _Anne_ and the following word?

4. Can you create an alphabetised list of all 9-letter words in _Persuasion_?

5. By default in Python, a ```.``` won't run past a ```\n``` character. Can you modify one of the above searches to include characters from the next line? You might need to look at the ```re``` (https://docs.python.org/3/library/re.html)[documentation] for the answer to this.

#### replacing

Use ```re.sub``` to replace some text in ```persuasion```. If you work on the whole novel, Python will have no trouble with this, but it might be hard to see the results. You might prefer to create a slice of Persuasion of a few hundred characters, so you can see the output of your replacement more easily.

This will overwrite the text of ```persuasion```, so if you prefer you can create a string with a different variable name, eg:

```modified_persuasion = re.sub(r"search string", "replacement", persuasion)```


#### finally

Can you explain why, above, we got slightly more results for ```persuasion.count("Anne")```, when compared to ```re.findall(r"{0,20}Anne{0,20}, persuasion)```? This is a bit tricky! Maybe create a small text of your own to test the way these two behave.