# DIGI405 - Regular expressions - a super-quick introduction

## What are regular expressions?

Regular expressions are sequences of characters that are used to search for a set of strings in a larger text.

## Where can I use them?

In this notebook we are working with Python, but they can be applied using other programming languages, text editors and command-line tools. 

## Why are they useful for text analysis applications?

You can use them in pre-processing to clean your data (e.g. removing menus or headers from a web-page you have scraped). You can use regular expressions to normalise your text (e.g. replacing different ways in which a Russian name like Евгений is rendered in English, Yevgeny, Yevgeniy, Evgeny, Evgeni, Evgeniy, Eugeny, with a standardised form). You can use regular expressions to tokenise text or extract features for analysis or modelling.

## Super-quick introduction through examples

This notebook will take you through some examples of using regular expressions using Python. This is intended to be a brief introduction. You will find lots of opportunities to put regular expressions to use in collecting and preparing texts for analysis.

We first need to import Python's library for regular expressions. You can find documentation here: https://docs.python.org/3/library/re.html and you will be able to find lots of general tutorials on all the features of the `re` library.

In [None]:
import re

As some example text we will use an opening portion of _Price and Prejudice_ by Jane Austen. This text was retrieved from https://www.gutenberg.org/ebooks/42671.txt.utf-8.

In [None]:
sample = '''It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered as the rightful
property of some one or other of their daughters.

"My dear Mr. Bennet," said his lady to him one day, "have you heard that
Netherfield Park is let at last?"

Mr. Bennet replied that he had not.

"But it is," returned she; "for Mrs. Long has just been here, and she
told me all about it."

Mr. Bennet made no answer.

"Do not you want to know who has taken it?" cried his wife impatiently.

"_You_ want to tell me, and I have no objection to hearing it."

This was invitation enough. 
'''

### Match a sequence of characters

In the first example below we use the findall function to find all the instances that `Bennet` is mentioned and provide a total count of matches. Change the value of pattern to try some other text you want to search for (e.g. another word).

In [None]:
pattern = 'Bennet' # set the regular expression
result = re.findall(pattern, sample) # this finds all instances of the pattern in sample and returns a list of strings
print(result) # print the list of results
print('Total matches:',len(result)) # print a count of the number of results

### Case insensitive regular expressions

If you search for `this` using the code above you will only get one match even though there are two instances of `this`. This is because the search is case sensitive. If you want a case insensitive search you can do so like the example below. Try changing the pattern to other text you want to search for.

In [None]:
pattern = 'this'
result = re.findall(pattern, sample, re.IGNORECASE)
print(result)
print('Total matches:',len(result))

### Matching words

If you search for `he` using the code above you get 14 matches. What is the problem with this search? Have a think about it and look at the sample from _Price and Prejudice_ above and then read on.

There are two instances of `he`. However, there are lots of instances of `he` used in words like `the` or `she`. The pattern below matches distinct instances of the word `he`. This regular expression contains two instances of `\b` - these match the boundaries between word characters and spaces or punctuation.

In [None]:
pattern = r'\bhe\b' # in the rest of the examples the pattern is defined as a raw string - see https://tinyurl.com/nhrarzv
result = re.findall(pattern, sample, re.IGNORECASE)
print(result)
print('Total matches:',len(result))

The pattern below includes two new features. Firstly, the `(` and `)` indicate the start and end of a group and specifies the part(s) of the regular expression we want to match. Secondly, `|` allows us to specify multiple conditions. In this case, the regular expression matches the pronouns `he` or `she`. Modify this regular expression to search for the words `he`, `she`, `his`, `her`. 

In [None]:
pattern = r'\b(he|she)\b'
result = re.findall(pattern, sample, re.IGNORECASE)
print(result)
print('Total matches:',len(result))

### Matching a set of characters

We can match a set of characters using `[` and `]`. In the following example I have removed the `re.IGNORECASE`, because we want to match precisely based on case. Replace the `[a]` with the following patterns to see the effects:   
`[abc]`  
`[a-z]`  
`[A-Z]`

In [None]:
pattern = r'[a]'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

### Matching multiple characters
We are not often interested in characters by themselves like this, but characters in combination. In this example we will match `[a-zA-Z]` i.e. the set of characters in the range A to Z and a to z. 

There are a number of ways of specifying repeated patterns. For example, the `{3}` in the example below controls the number of [A-Za-z] characters that will be matched. This example matches all the 3-letter words. Replace the `{3}` with the following patterns to see the effects:  
`{4}`  
`{3,4}`  
`{5,}`    

In [None]:
pattern = r'\b[a-zA-Z]{3}\b'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

In the next example we use a regular expression to extract years from the sample text by matching 4 digit numbers.

In [None]:
# Pride and Prejudice's intro doesn't have numbers - this text from https://en.wikipedia.org/wiki/New_Zealand
numbers_sample = '''Sometime between 1250 and 1300, Polynesians settled in the islands that later were named 
New Zealand and developed a distinctive Māori culture.'''

pattern = r'[0-9]{4}' # this could also be written as \d{4} - \d matches digits
result = re.findall(pattern, numbers_sample)
print(result)
print('Total matches:',len(result))

There are other ways of specifying repetitions:  
`+` matches 1 or more characters  
`*` matches 0 or more characters  
`?` matches 0 or 1 characters  

So for example, the following pattern matches words starting with an upper-case character. So the regular expression looks for 1 or more uppercase characters and then 0 or more lowercase characters. Why "0 or more"? You can have a word with a single upper-case character e.g. `I`.

In [None]:
pattern = r'\b[A-Z]+[a-z]*\b'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

Here we match any word ending in `ing`:

In [None]:
pattern = r'\b[\w]+ing\b'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

### Tokenising text

How do we tokenise a text? You are probably aware from looking at the examples in the lecture that strings can be tokenised using regular expressions. There are multiple ways we can do this. In the following example we retrieve tokens based on strings that contain sequences of one or more `\w` characters. `\w`  matches any alphanumeric character and here we are matching 1 or more character in a sequence.

In [None]:
pattern = r'\w+'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

Why are we using `\w` instead of `[A-Za-z]`? This is safer for unicode strings. Think about the the word _Māori_, which contains a macron. You can see in the sentence below what happens when we use `[A-Za-z]` to tokenise a simple string.

In [None]:
sample_with_macrons = 'Māori are the tangata whenua – the people of the land. '
pattern = r'[A-Za-z]+'
result = re.findall(pattern, sample_with_macrons)
print('Tokenising based on',pattern)
print(result)
print('Total matches:',len(result))

print()

pattern = r'\w+'
result = re.findall(pattern, sample_with_macrons)
print('Tokenising based on',pattern)
print(result)
print('Total matches:',len(result))

Remember: there is more than one way to apply regular expressions to tokenise text. Using `\w+`, the dollar amount in the following sentence would be split into multiple tokens: "The coffee cost $5.50!". Take a moment and search google for "tokenise text regular expression" for some different patterns you can use to tokenise text and try them in the next cell. Change the `\w+` pattern to whatever you find.

In [None]:
pattern = r'\w+'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

### A more complex application: extracting direct speech

The following example does something more complex: extracting direct speech from the introduction to _Pride and Prejudice_. 

There is lots packed into this short regular expression. 

This example introduces one function of the `^` character in regular expressions. When `^` is used as the first character in a set of characters, all the characters not in the set will be matched (e.g. `[^a-z]` matches any character other than those in the a to z range). 

Here we are matching any sequence of characters that: 
1. `"` begins with a `"` character to indicate the start of the direct speech;
2. `[^"]*` is followed by 0 or more instances of any character that is _not_ `"` ; and, 
3. `"` ends with a `"` character that indicates the end of the direct speech. 

In [None]:
pattern = r'"[^"]*"'
result = re.findall(pattern, sample)
print(result)
print('Total matches:',len(result))

### Here are some more resources to help you learn about regular expressions

This isn't all the functionality of regular expressions. Take some time to look at the documentation for the [Python re library](https://docs.python.org/3/library/re.html), the reading for [week 5 on regular expressions](https://automatetheboringstuff.com/chapter7/) and sites that allow you to test regular expressions in your browser. For example, https://regex101.com/ has a great interface and has reference material as well (switch the flavour to Python).